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Introduction 


This is a course on discrete mathematics as used in Computer Science. It’s 
only a one-semester course, so there are a lot of topics that it doesn’t cover 
or doesn’t cover in much depth. But the hope is that this will give you a 
foundation of skills that you can build on as you need to, and particularly 
to give you a bit of mathematical maturity—the basic understanding of 
what mathematics is and how mathematical definitions and proofs work. 


1.1 So why do I need to learn all this nasty math- 
ematics? 


Why you should know about mathematics, if you are interested in Computer 
Science: or, more specifically, why you should take CS202 or a comparable 
course: 


e Computation is something that you can’t see and can’t touch, and yet 
(thanks to the efforts of generations of hardware engineers) it obeys 
strict, well-defined rules with astonishing accuracy over long periods of 
time. 


e Computations are too big for you to comprehend all at once. Imagine 
printing out an execution trace that showed every operation a typical 
$500 desktop computer executed in one (1) second. If you could read 
one operation per second, for eight hours every day, you would die 
of old age before you got halfway through. Now imagine letting the 
computer run overnight. 


So in order to understand computations, we need a language that allows 
us to reason about things we can’t see and can’t touch, that are too big 
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for us to understand, but that nonetheless follow strict, simple, well-defined 
rules. We’d like our reasoning to be consistent: any two people using the 
language should (barring errors) obtain the same conclusions from the same 
information. Computer scientists are good at inventing languages, so we 
could invent a new one for this particular purpose, but we don’t have to: 
the exact same problem has been vexing philosophers, theologians, and 
mathematicians for much longer than computers have been around, and 
they’ve had a lot of time to think about how to make such a language work. 
Philosophers and theologians are still working on the consistency part, but 
mathematicians (mostly) got it in the early 20th-century. Because the first 
virtue of a computer scientist is laziness, we are going to steal their code. 


1.2. But isn’t math hard? 


Yes and no. The human brain is not really designed to do formal mathematical 
reasoning, which is why most mathematics was invented in the last few 
centuries and why even apparently simple things like learning how to count 
or add require years of training, usually done at an early age so the pain 
will be forgotten later. But mathematical reasoning is very close to legal 
reasoning, which we do seem to be very good at.! 

There is very little structural difference between the two sentences: 


1. Ifz isin S, then +1 is in S. 
2. If x is of royal blood, then z’s child is of royal blood. 


But because the first is about boring numbers and the second is about 
fascinating social relationships and rules, most people have a much easier 
time deducing that to show somebody is royal we need to start with some 
known royal and follow a chain of descendants than they have deducing that 
to show that some number is in the set S we need to start with some known 
element of S and show that repeatedly adding 1 gets us to the number we 
want. And yet to a logician these are the same processes of reasoning. 

So why is statement (1) trickier to think about than statement (2)? Part 
of the difference is familiarity—we are all taught from an early age what it 
means to be somebody’s child, to take on a particular social role, etc. For 
mathematical concepts, this familiarity comes with exposure and practice, 
just as with learning any other language. But part of the difference is that 


'For a description of some classic experiments that demonstrate this, see http: //en. 
wikipedia. org/wiki/Wason_selection_task. 
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we humans are wired to understand and appreciate social and legal rules: 
we are very good at figuring out the implications of a (hypothetical) rule 
that says that any contract to sell a good to a consumer for $100 or more 
can be canceled by the consumer within 72 hours of signing it provided the 
good has not yet been delivered, but we are not so good at figuring out the 
implications of a rule that says that a number is composite if and only if it 
is the product of two integer factors neither of which is 1. It’s a lot easier to 
imagine having to cancel a contract to buy swampland in Florida that you 
signed last night while drunk than having to prove that 82 is composite. But 
again: there is nothing more natural about contracts than about numbers, 
and if anything the conditions for our contract to be breakable are more 
complicated than the conditions for a number to be composite. 


1.3. Thinking about math with your heart 


There are two things you need to be able to do to get good at mathematics 
(the creative kind that involves writing proofs, not the mechanical kind that 
involves grinding out answers according to formulas). One of them is to learn 
the language: to attain what mathematicians call mathematical maturity. 
You'll do that in CS202, if you pay attention. But the other is to learn 
how to activate the parts of your brain that are good at mathematical-style 
reasoning when you do math—the parts evolved to detect when the other 
primates in your band of hunter-gatherers are cheating. 

To do this it helps to get a little angry, and imagine that finishing a proof 
or unraveling a definition is the only thing that will stop your worst enemy 
from taking some valuable prize that you deserve. (If you don’t have a worst 
enemy, there is always the universal quantifier.) But whatever motivation 
you choose, you need to be fully engaged in what you are doing. Your brain 
is smart enough to know when you don’t care about something, and if you 
don’t believe that thinking about math is important, it will think about 
something else. 


1.4 What you should know about math 


We won’t be able to cover all of this, but the list below might be a minimal 
set of topics it would be helpful to understand for computer science. Topics 
that we didn’t do this semester are marked with (*). 
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1.4.1 Foundations and logic 


Why: This is the assembly language of mathematics—the stuff at the bottom 
that everything else compiles to. 


e Propositional logic. 

e Predicate logic. 

e Axioms, theories, and models. 
e Proofs. 


Induction and recursion. 


1.4.2 Basic mathematics on the real numbers 


Why: You need to be able to understand, write, and prove equations and 
inequalities involving real numbers. 


e Standard functions and their properties: addition, multiplication, ex- 
ponentiation, logarithms. 


e More specialized functions that come up in algorithm analysis: floor, 
ceiling, max, min. 


e Techniques for proving inequalities, including: 


— General inequality axioms (transitivity, anti-symmetry, etc.) 


— Inequality axioms for R (i.e., how < interacts with addition, 
multiplication, etc.) 


— Techniques involving derivatives (assumes calculus) (*): 
* Finding local extrema of f by solving for f’(x) = 0. (*) 
*« Using f” to distinguish local minima from local maxima. (*) 
*« Using f’(x) < g‘(x) in [a,b] and f(a) < g(a) or f(b) < g(b) 
to show f(x) < g(x) in [a,b]. (*) 


e Special subsets of the real number: rationals, integers, natural numbers. 
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1.4.3. Fundamental mathematical objects 


Why: These are the mathematical equivalent of data structures, the way 
that more complex objects are represented. 


e Set theory. 


— Naive set theory. 

— Predicates vs sets. 
— Set operations. 

— Set comprehension. 


— Russell’s paradox and axiomatic set theory. 
e Functions. 


— Functions as sets. 

— Injections, surjections, and bijections. 
— Cardinality. 

— Finite vs infinite sets. 


— Sequences. 
e Relations. 


— Equivalence relations. Equivalence classes and quotients. 


— Orders: total orders, partial orders, lattics, and well orders. Order 
types and ordinals. 


e The basic number tower. 


— Countable universes: N, Z,Q. (Can be represented in a computer.) 


— Uncountable universes: R,C. (Can only be approximated in a 
computer.) 


e Other algebras. 


— The string monoid. (*) 
— Zm and Zp. 


— Polynomials over various rings and fields. 
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1.4.4 Modular arithmetic and polynomials 
Why: Basis of modern cryptography. 

e Arithmetic in Z,,. 

e Primes and divisibility. 

e Euclid’s algorithm and inverses. 

e The Chinese Remainder Theorem. 


e Fermat’s Little Theorem and Euler’s Theorem. 


RSA encryption. 


Galois fields and applications. 


1.4.5 Linear algebra 


Why: Shows up everywhere. 


Vectors and matrices. 


e Matrix operations and matrix algebra. 


e Inverse matrices and Gaussian elimination. 


Geometric interpretations. 


1.4.6 Graphs 
Why: Good for modeling interactions. Basic tool for algorithm design. 
e Definitions: graphs, digraphs, multigraphs, etc. 
e Paths, connected components, and strongly-connected components. 
e Special kinds of graphs: paths, cycles, trees, cliques, bipartite graphs. 


e Subgraphs, induced subgraphs, minors. 
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1.4.7 Counting 


Why: Basic tool for knowing what resources your program is going to 
consume. 


e Basic combinatorial counting: sums, products, exponents, differences, 
and quotients. 


e Combinatorial functions. 


— Factorials. 
— Binomial coefficients. 
— The 12-fold way. (*) 


e Advanced counting techniques. 


— Inclusion-exclusion. 
— Recurrences. (*) 
— Generating functions. (Limited coverage.) 


1.4.8 Probability 


Why: Can’t understand randomized algorithms or average-case analysis 
without it. Handy if you go to Vegas. 


e Discrete probability spaces. 
e Events. 


e Independence. 


Random variables. 


Expectation and variance. 
e Probabilistic inequalities. 
— Markov’s inequality. 
— Chebyshev’s inequality. (*) 
— Chernoff bounds. (*) 


e Stochastic processes. (*) 


— Markov chains. (*) 
— Martingales. (*) 


— Branching processes. (*) 
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1.4.9 Tools 


Why: Basic computational stuff that comes up, but doesn’t fit in any of the 
broad categories above. These topics will probably end up being mixed in 
with the topics above. 


e Things you may have forgotten about exponents and logarithms. (*) 


e Inequalities and approximations. 


>> and J] notation. 
e How to differentiate and integrate simple functions. (*) 


e Computing or approximating the value of a sum. 


Asymptotic notation. 


Chapter 2 


Mathematical logic 


Mathematical logic is the discipline that mathematicians invented in the late 
nineteenth and early twentieth centuries so they could stop talking nonsense. 
It’s the most powerful tool we have for reasoning about things that we can’t 
really comprehend, which makes it a perfect tool for Computer Science. 


2.1 The basic picture 


Reality Model Theory 

herds of sheep 

piles of rocks — N= {0,1,2,...} 7% Va:dy:y=a+1 
tally marks 


We want to model something we see in reality with something we can fit 
in our heads. Ideally we drop most of the features of the real thing that we 
don’t care about and keep the parts that we do care about. But there is a 
second problem: if our model is very big (and the natural numbers are very 
very big), how do we know what we can say about them? 


2.1.1 Axioms, models, and inference rules 


One approach is to come up with a list of axioms that are true statements 
about the model and a list of inference rules that let us derive new true 
statements from the axioms. The axioms and inference rules together generate 
a theory that consists of all statements that can be constructed from the 
axioms by applying the inference rules. The rules of the game are that we 
can’t claim that some statement is true unless it’s a theorem: something 
we can derive as part of the theory. 
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Simple example: All fish are green (axiom). George Washington is a 
fish (axiom). From “all X are Y” and “Z is X”, we can derive “Z is Y” 
(inference rule). Thus George Washington is green (theorem). Since we can’t 
do anything else with our two axioms and one inference rule, these three 
statements together form our entire theory about George Washington, fish, 
and greenness. 

Theories are attempts to describe models. A model is typically a 
collection of objects and relations between them. For a given theory, there 
may be many models that are consistent with it: for example, a model that 
includes both green fishy George Washington and MC 900-foot Abraham 
Lincoln is consistent with the theory above, because the theory doesn’t say 
anything about Abraham Lincoln. 


2.1.2 Consistency 


A theory is consistent if it can’t prove both P and not-P for any P. 
Consistency is incredibly important, since all the logics people actually use 
can prove anything if you start with P and not-P. 


2.1.3. What can go wrong 


If we throw in too many axioms, you can get an inconsistency: “All fish are 
green; all sharks are not green; all sharks are fish; George Washington is a 
shark” gets us into trouble pretty fast. 

If we don’t throw in enough axioms, we underconstrain the model. For 
example, the Peano axioms for the natural numbers (see example below) say 
(among other things) that there is a number 0 and that any number x has a 
successor S(x) (think of S(x) as x +1). If we stop there, we might have a 
model that contains only 0, with S(0) = 0. If we add in 0 4 S(a) for any 
x, then we can get stuck at S(0) = 1= S(1). If we add yet another axiom 
that says S(x) = S(y) if and only if « = y, then we get all the ordinary 
natural numbers 0,.$(0) = 1,S(1) = 2, etc., but we could also get some 
extras: say 0’,.$(0’) = 1’, S(1’) = 0’. Characterizing the “correct” natural 
numbers historically took a lot of work to get right, even though we all know 
what we mean when we talk about them. The situation is of course worse 
when we are dealing with objects that we don’t really understand; here the 
most we can hope for is to try out some axioms and see if anything strange 
happens. 

Better yet is to use some canned axioms somebody else has already 
debugged for us. In this respect the core of mathematics acts like a system 
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library—it’s a collection of useful structures and objects that are known to 
work, and (if we are lucky) may even do exactly what we expect. 


2.1.4 The language of logic 


The basis of mathematical logic is propositional logic, which was mostly 
invented in ancient Greece.! Here the model is a collection of statements 
that are either true or false. There is no ability to refer to actual things; 
though we might include the statement “George Washington is a fish”, from 
the point of view of propositional logic that is an indivisible atomic chunk of 
truth or falsehood that says nothing in particular about George Washington 
or fish. If we treat it as an axiom we can prove the truth of more complicated 
statements like “George Washington is a fish or 2+2=5” (true since the first 
part is true), but we can’t really deduce much else. Still, this is a starting 
point. 

If we want to talk about things and their properties, we must upgrade 
to predicate logic. Predicate logic adds both constants (stand-ins for 
objects in the model like “George Washington”) and predicates (stand-ins 
for properties like “is a fish”). It also lets us quantify over variables and 
make universal statements like “For all x, if x is a fish then x is green.” As 
a bonus, we usually get functions (“f(x) = the number of books George 
Washington owns about x”) and equality (“George Washington = 12” implies 
“George Washington + 5 = 17”). This is enough machinery to define and do 
pretty much all of modern mathematics. 

We will discuss both of these logics in more detail below. 


2.1.5 Standard axiom systems and models 


Rather than define our own axiom systems and models from scratch, it helps 
to use ones that already have a track record of consistency and usefulness. 
Almost all mathematics fits in one of the following models: 


e The natural numbers N. These are defined using the Peano axioms, 
and if all you want to do is count, add, and multiply, you don’t need 
much else. (If you want to subtract, things get messy.) 


e The integers Z. Like the naturals, only now we can subtract. Division 
is still a problem. 


‘See https: //plato.stanford.edu/entries/logic-ancient/ for a nuanced and de- 
tailed explanation of the actual history. I would like to thank Nick Halme for pointing me 
to this resource after observing some deficiencies in the version of the story previously told 
in these notes. 


CHAPTER 2. MATHEMATICAL LOGIC 12 


The rational numbers Q. Now we can divide. But what about \/2? 
The real numbers R. Now we have V2. But what about \/(—1)? 


The complex numbers C. Now we are pretty much done. But what if 
we want to talk about more than one complex number at a time? 


The universe of sets. These are defined using the axioms of set the- 
ory, and produce a rich collection of sets that include, among other 
things, structures equivalent to the natural numbers, the real numbers, 
collections of same, sets so big that we can’t even begin to imagine 
what they look like, and even bigger sets so big that we can’t use the 
usual accepted system of axioms to prove whether they exist or not. 
Fortunately, in computer science we can mostly stop with finite sets, 
which makes life less confusing. 


Various alternatives to set theory, like lambda calculus, category theory, 
or second-order arithmetic. We won’t talk about these, since they 
generally don’t let you do anything you can’t do already with sets. 
However, lambda calculus and category theory are both important to 
know about if you are interested in programming language theory. 


In practice, the usual way to do things is to start with sets and then define 
everything else in terms of sets: e.g., 0 is the empty set, 1 is a particular set 
with 1 element, 2 a set with 2 elements, etc., and from here we work our way 
up to the fancier numbers. The idea is that if we trust our axioms for sets 
to be consistent, then the things we construct on top of them should also be 
consistent, although if we are not careful in our definitions they may not be 
exactly the things we think they are. 


2.2 


Propositional logic 


Propositional logic is the simplest form of logic. Here the only statements 

that are considered are propositions, which contain no variables. Because 

propositions contain no variables, they are either always true or always false. 
Examples of propositions: 


e 24+2=4. (Always true). 


e 24+2=5. (Always false). 


Examples of non-propositions: 
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e x«+2=4. (May be true, may not be true; it depends on the value of 


e x-0=0. (Always true, but it’s still not a proposition because of the 
variable.) 


e «-0=1. (Always false, but not a proposition because of the variable.) 


As the last two examples show, it is not enough for a statement to be 
always true or always false—whether a statement is a proposition or not is 
a structural property. But if a statement doesn’t contain any variables (or 
other undefined terms), it is a proposition, and as a side-effect of being a 
proposition it’s always true or always false. 


2.2.1 Operations on propositions 


Propositions by themselves are pretty boring. So boring, in fact, that 
logicians quickly stop talking about specific propositions and instead haul 
out placeholder names like p, g, or r. But we can build slightly more 
interesting propositions by combining propositions together using various 
logical connectives, such as: 


Negation The negation of p is written as —p, or sometimes ~p, —p or Pp. 
It has the property that it is false when p is true, and true when p is 
false. 


Or The or of two propositions p and q is written as p V q, and is true as 
long as at least one, or possibly both, of p and q is true.” This is not 
always the same as what “or” means in English; in English, “or” often 
is used for exclusive or which is not true if both p and q are true. For 
example, if someone says “You will give me all your money or I will 
stab you with this table knife”, you would be justifiably upset if you 
turn over all your money and still get stabbed. But a logician would 
not be at all surprised, because the standard “or” in propositional logic 
is an inclusive or that allows for both outcomes. 


Exclusive or If you want to exclude the possibility that both p and q are 
true, you can use exclusive or instead. This is written as p @ q, and 


?The symbol V is a stylized V, intended to represent the Latin word vel, meaning “or.” 
(Thanks to Noel McDermott for remembering this.) Much of this notation is actually pretty 
recent (early 20th century): see http://jeff£560.tripod.com/set.html for a summary of 
earliest uses of each symbol. 
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is true precisely when exactly one of p or q is true. Exclusive or is 
not used in classical logic much, but is important for many computing 
applications, since it corresponds to addition modulo 2 (see §8.3) and 
has nice reversibility properties (e.g. p® (p@ q) always has the same 
truth-value as q). 


And The and of p and q is written as p A q, and is true only when both p 
and gq are true.® This is pretty much the same as in English, where “I 
like to eat ice cream and I own a private Caribbean island” is not a 
true statement when made by most people even though most people 
like to eat ice cream. The only complication in translating English 
expressions into logical ands is that logicians can’t tell the difference 
between “and” and “but”: the statement “2 + 2 = 4 but 3+ 3 = 6” 
becomes simply “(2 + 2 = 4) \(3+3 = 6).” 


Implication This is the most important connective for proofs. An impli- 
cation represents an “if...then” claim. If p implies q, then we write 
p—qor p= q, depending on our typographic convention and the 
availability of arrow symbols in our favorite font. In English, p > q 
is usually rendered as “If p, then gq,” as in “If you step on your own 
head, it will hurt.” The meaning of p > q is that q is true whenever 
p is true, and the proposition p — q is true provided (a) p is false (in 
which case all bets are off), or (b) q is true. 


In fact, the only way for p — q to be false is for p to be true but q to 
be false. Because of this, p + q can be rewritten as ap V q. So, for 
example, the statements “If 2+ 2 = 5, then I’m the Pope”, “If I’m the 
Pope, then 2+ 2 = 4”, and “If 2+2=4, then 3+ 3 = 6”, are all true, 
provided the if/then is interpreted as implication. 


Normal English usage does not always match this pattern. Instead, 
if/then in normal speech is often interpreted as the much stronger 
biconditional (see below), and often carries connotations of causality. 
So if I say—entirely truthfully—“If the moon is made of green cheese, 
then the world will end at midnight,” my listeners will think I have 
some mechanism in mind by which a green-cheese moon will end the 
world. But all I am doing is taking advantage of my knowledge that 
the moon is not made of green cheese to make a statement that is 
trivially true, because it has a false premise. This is another example 
of how the language of logic strips away the vast cloud of secondary 


3The symbol A is a stylized A, short for the latin word atque, meaning “and also.” 
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NOT p 7p D,~p 
p AND q pAq 
p XOR q peq 
p OR q pVq 
p implies q pq p=>4,P2q 
pifand only ifq pogq ped 


Table 2.1: Compound propositions. The rightmost column gives alternate 
forms. Precedence goes from strongest for = to weakest for < (but see 
§2.2.1.1 for a discussion of variation in conventions for this). 


messages and hidden assumptions carried by ordinary speech, perhaps 
the most important of which is the assumption that if I say something, 
it should mean something, and not just be a formal exercise in symbol 
manipulation. 


Biconditional Suppose that p — q and q — p, so that either both p and 
q are true or both p and q are false. In this case, we write p © q 
or p & q, and say that p holds if and only if q holds. The truth 
of p © q is still just a function of the truth or falsehood of p and q; 
though there doesn’t need to be any connection between the two sides 
of the statement, “2 + 2 = 5 if and only if I am the Pope” is a true 
statement (provided it is not uttered by the Pope). The only way for 
p © q to be false is for one side to be true and one side to be false. 


The result of applying any of these operations is called a compound 
proposition. 

Table 2.1 shows what all of this looks like when typeset nicely. Note that 
in some cases there is more than one way to write a compound expression. 
Which you choose is a matter of personal preference, but you should try to 
be consistent. 


2.2.1.1 Precedence 


The short version: for the purposes of this course, we will use the ordering in 
Table 2.1, which corresponds roughly to precedence in C-like programming 
languages. But see caveats below. Remember always that there is no shame 
in putting in a few extra parentheses if it makes a formula more clear. 
Examples: (ap VV qgAr > s + ft) is interpreted as ((((ap) V (¢Ar)) > 
s) 4 t). Both OR and AND are associative, so (pV q Vr) is the same as 
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((pV q) Vr) and as (pV (qV r)), and similarly (p \q/ 1) is the same as 
((p\ gq) Ar) and as (pA (qA1r)). 

Note that this convention is not universal: many mathematicians give 
AND and OR equal precedence, so that the meaning of pA q Vr is ambigu- 
ous without parentheses. There are good arguments for either convention. 
Making AND have higher precedence than OR is analogous to giving multi- 
plication higher precedence than addition, and makes sense visually when 
AND is written multiplicatively (as in pq V gr for (pq) V (qA 1). Making 
them have the same precedence emphasizes the symmetry between the two 
operations, which we’ll see more about later when we talk about De Morgan’s 
laws in §2.2.3. But as with anything else in mathematics, either convention 
can be adopted, as long as you are clear about what you are doing and it 
doesn’t cause annoyance to the particular community you are writing for. 

There does not seem to be a standard convention for the precedence of 
XOR, since logicians don’t use it much. There are plausible arguments for 
putting XOR in between AND and OR, but it’s probably safest just to use 
parentheses. 

Implication is not associative, although the convention is that it binds 
“to the right,” so that a > b > c is read as a + (b > c). Except for 
type theorists and Haskell programmers, few people ever remember this, 
so it is usually safest to put in the parentheses. I personally have no idea 
what pq“ r means, so any expression like this should be written with 
parentheses as either (pO q) GO rorp (qr). 


2.2.2 Truth tables 


To define logical operations formally, we give a truth table. This gives, for 
any combination of truth values (true or false, which as computer scientists 
we often write as 1 or 0) of the inputs, the truth value of the output. In this 
usage, truth tables are to logic what addition and multiplication tables are 
to arithmetic. 

Here is a truth table for negation: 


Fos 
Re 
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And here is a truth table for the rest of the logical operators: 


Pa pYqd p®q pAq p>q pq 
00 0 0 0 1 1 
oa 4 1 0 1 0 
1-0: “4S 1 0 0 0 
i 0 1 1 1 


See also [Fer08, §1.1], [Ros12, §§1.1-1.2], or [Big02, §§3.1-3.3]. 

We can think of each row of a truth table as a model for propositional 
logic, since the only things we can describe in propositional logic are whether 
particular propositions are true or not. Constructing a truth table corre- 
sponds to generating all possible models. 

This can be useful if we want to figure out when a particular proposition 
is true. Proving a proposition using a truth table is a simple version of 
model checking: we enumerate all possible models of a given collection 
of simple propositions, and see if what we want to prove holds in all models. 
This works for propositional logic because the list of models is just the list 
of possible combinations of truth values for all the simple propositions P, 
Q, etc. We can check that each truth table we construct works by checking 
that the truth values each column (corresponding to some subexpression of 
the thing we are trying to prove) follow from the truth values in previous 
columns according to the rules established by the truth table defining the 
appropriate logical operation. 

For predicate logic, model checking becomes more complicated, because 
a typical system of axioms is likely to have infinitely many models, many of 
which are likely to be infinitely large. There we will need to rely much more 
on proofs constructed by applying inference rules. 


2.2.3. Tautologies and logical equivalence 


A compound proposition that is true no matter what the truth-values of the 
propositions it contains is called a tautology. For example, p > p, pV 7p, 
and —(p A —p) are all tautologies, as can be verified by constructing truth 
tables. If a compound proposition is always false, it’s a contradiction. The 
negation of a tautology is a contradiction and vice versa. 

The most useful class of tautologies are logical equivalences. This is a 
tautology of the form X ~ Y, where X and Y are compound propositions. 
In this case, X and Y are said to be logically equivalent and we can 
substitute one for the other in more complex propositions. We write X = Y 
if X and Y are logically equivalent. 
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The nice thing about logical equivalence is that is does the same thing for 
Boolean formulas that equality does for algebraic formulas: if we know (for 
example), that p V 7p is equivalent to 1, and q V 1 is equivalent to 1, we can 
grind qV pV np = qV 1 = 1 without having to do anything particularly clever. 
(We will need cleverness later when we prove things where the consequent 
isn’t logically equivalent to the premise.) 

To prove a logical equivalence, one either constructs a truth table to show 
that X © Y is a tautology, or transforms X to Y using previously-known 
logical equivalences. 

Some examples: 


e p\7p=0: Construct a truth table 


p 7p pAn7p 0 
Or 2 0 0 
1 0 0 0 


and observe that the last two columns are always equal. 


e pVp=p: Use the truth table 


pq pq —pVvq 
OO ai 1 
Gs? Af i 
10 0 0 
11 1 i 


e —=(pV q) = -pA-7g: (one of De Morgan’s laws; the other is =(p A q) = 


ap V 7q). 
Pq pVq ~(pVq) 7p -q -=pAnq 
0: 70 1 {. 1 
G4. 4 0 1 0 0 
tee’. “al 0 O° a 0 
Pe. + ai 0 0 O 0 
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e pV (qAr) =(pVq)A (pV r) (one of the distributive laws; the other is 


PA(qVT) =(PAQV (PAr)). 
p qr qr pV(qAr) pVq pvr (pVq)A(pVvr) 
000 0 ) 0 0 0 
001 0 1) 0 1 1) 
010 0 0 1 0 0 
o-4, i F 1 1 1 1 
100 0 1 1 1 1 
101 0 1 1 1 1 
iio 0 1 1 1 1 
bik kh A 1 1 i 1 


e(p>r)V(q>7r) = (pAq) — 7r. Now things are getting messy, 
so building a full truth table may take a while. But we have take a 
shortcut by using logical equivalences that we’ve already proved (plus 
associativity of V): 


(par)V(q>r)=(ApVr)V (-qV r) [Using p > q = 7p V q twice 
=apV-qVrvr [Associativity and commutativity of V 
=-ApV7qVr [p=pVp 
=-(pAq)Vr [De Morgan’s law 
=(pAg) =r. pS g=a7Vq 


This last equivalence is a little surprising. It shows, for example, that 
if somebody says “It is either the case that if you study you will graduate 
from Yale with distinction, or that if you join the right secret society you 
will graduate from Yale with distinction”, then this statement (assuming 
we treat the or as V) is logically equivalent to “If you study and join the 
right secret society, then you will graduate from Yale with distinction.” It is 
easy to get tangled up in trying to parse the first of these two propositions; 
translating to logical notation and simplifying using logical equivalence is a 
good way to simplify it. 

Over the years, logicians have given names to many logical equivalences. 
Some of the more useful ones are summarized in Table 2.2. More complicated 
equivalences can often be derived from these. If that doesn’t work (and you 
don’t have too many variables to work with), you can always try writing out 
a truth table. 
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a(p \ q) = 7p V 7q 
a(pV q) = 7p A7q 


( )A 
PV(aVr)=(PVQVvr 

( )v 

( yA 


p> q=—pVq 

p> q=—4>-p 
poq=(p>a)A(q>P) 
poq= peg 
poq=aqep 


Double negation 

De Morgan’s law 

De Morgan’s law 
Commutativity of AND 
Commutativity of OR 
Associativity of AND 
Associativity of OR 
AND distributes over OR 
OR distributes over AND 


Equivalence of implication and OR 


Contraposition 


Expansion of if and only if 


Inverse of if and only f 


Commutativity of if and only if 


20 


Table 2.2: Common logical equivalences (see also [Fer08, Theorem 1.1}) 
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2.2.3.1 Inverses, converses, and contrapositives 


The contrapositive of p > q is ~q > —p; it is logically equivalent to the 
original implication. For example, the contrapositive of “If I am human 
then Iam a mammal” is “If I am not a mammal then I am not human”. A 
proof by contraposition demonstrates that p implies q by assuming 7q 
and then proving —p; it is similar but not identical to an indirect proof, 
which assumes —p and derives a contradiction. 

The inverse of p > q is —p > 7g. So the inverse of “If you take CPSC 
202, you will surely die” is “If you do not take CPSC 202, you will not surely 
die.” There is often no connection between the truth of an implication and 
the truth of its inverse: “If I am human then I am a mammal” does not 
have the same truth-value as “If I am not human then I am not a mammal,” 
barring some over-the-top ecological disaster. 

The converse of p > q is q—> p. E.g. the converse of “If I am human 
then Iam a mammal” is “If 1am a mammal then I am human.” The converse 
of a statement is always logically equivalent to the inverse. Often in proving a 
biconditional (e.g., “I am human and only if Iam a mammal”), one proceeds 
by proving first the implication in one direction and then either the inverse 
or the converse, as either is logically equivalent to the implication in the 
other direction. 


2.2.3.2 Equivalences involving true and false 


Any tautology is equivalent to true; any contradiction is equivalent to false. 
Two important cases of this are the law of the excluded middle 


and its dual, the law of non-contradiction 
PA7AP=0. 


The law of the excluded middle is what allows us to do case analysis, where 
we prove that some proposition Q holds by showing first that P implies Q 
and then that =P also implies Q.* 


“Though we will use the law of the excluded middle, it has always been a little bit 
controversial, because it is non-constructive: it tells you that one of P or —P is true, 
but it doesn’t tell you which. 

For this reason, some logicians adopt a variant of classical logic called intuitionistic 
logic where the law of the excluded middle does not hold. Though this was originally 
done for aesthetic reasons, it turns out that there is a deep connection between computer 
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One strategy for simplifying logical expressions is to try to apply known 
equivalences to generate sub-expressions that reduce to true or false via the 
law of the excluded middle or the law of non-contradiction. These can then 
be absorbed into nearby terms using various absorption laws, shown in 
Table 2.3. 


Example Let’s show that (P \ (P > Q)) > Q is a tautology. (This 
justifies the inference rule modus ponens, defined below.) Working from the 


programs and proofs in intuitionistic logic, known as the Curry-Howard isomorphism. 
The idea is that you get intuitionistic logic if you interpret 


e Pas an object of type P; 

e P—+Qasa function that takes a P as an argument and returns a Q; 

e PAQ as an object that contains both a P and a Q (like a struct in C); 

e PV Q as an object that contains either a P or a Q (like a union in C); and 


e —P as P > 1, a function that given a P produces a special error value | that can’t 
otherwise be generated. 


With this interpretation, many theorems of classical logic continue to hold. For example, 
modus ponens says 
(PA(P>Q)) > Q@. 


Seen through the Curry-Howard isomorphism, this means that there is a function that, 
given a P and a function that generates a Q from a P, generates a Q. For example, the 
following Scheme function: 


(define (modus-ponens p p-implies q) (p-implies-q p)) 


Similarly, in a sufficiently sophisticated programming language we can show P + —-P, 
since this expands to P > ((P > L) > 1), and we can write a function that takes a P as 
its argument and returns a function that takes a P — function and feeds the P to it: 


(define (double-negation p) (lambda (p-implies-fail) 
(p-implies-fail p))) 

But we can’t generally show ——~P — P, since there is no way to take a function of type 
(P > L) > 1 and extract an actual example of a P from it. Nor can we expect to show 
P\-—P, since this would require exhibiting either a P or a function that takes a P and 
produces an error, and for any particular type P we may not be able to do either. 

For normal mathematical proofs, we won’t bother with this, and will just assume PV =P 
always holds. 
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PA0=0 PV0=P 
PA1=P PV1=l!1 
Ps0=-P P@0=P 
Pos1=P P@1=-7P 
P+0=-P 0-> P=1 
Pol=l1 15> P=P 


Table 2.3: Absorption laws. The first four are the most important. Note 
that A, V, @, and © are all commutative, so reversed variants also work. 


inside out, we can compute 


(PA(P > Q)) 9 Q=(PA(-APVQ)) -Q expand — 
=((PA7=P)V(PAQ)) > @Q_ distribute V over A 
=(0V(PAQ)) -Q non-contradiction 
=(PAQ)-Q absorption 
=-=(PAQ)VQ expand > 
=(APV-=Q) VQ De Morgan’s law 
= PV (=-Q VQ) associativity 
=APVv1 excluded middle 
=1 absorption 


In this derivation, we’ve labeled each step with the equivalence we used. 
Most of the time we would not be this verbose. 


2.2.4 Normal forms 


A compound proposition is in conjunctive normal form (CNF for short) 
if it is obtained by ANDing together ORs of one or more variables or their 
negations (an OR of one variable is just the variable itself). So for example 
PL (PV QO) A BR, (CP VQ) Ae CRY A GP), and (PV Q) A. CPV ARPA 
(APVQVSVTV-—U) are in CNF, but (PV Q) A (PV AR) A (-AP AQ), 
(PVQ)A(P > R)A(AP VQ), and (PV (QA R)) A(PV AR) A (=P V Q) are 
not. Using the equivalence P > Q = =P V Q, De Morgan’s laws, and the 
distributive law, it is possible to rewrite any compound proposition in CNF. 
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This doesn’t necessarily produce the simplest CNF. A famous Zen koan 
involves a student going for instruction to a swordmaster who also happens 
to be a Zen monk. The master tells the student “If you draw your sword, 
I will cut off your head. If you do not draw your sword, I will cut off your 
head.” How should the student interpret this alarming statement? 

Writing P for the proposition that the student draws his sword and Q 
for the proposition that the master cuts off his head, we can immediately 
convert this to CNF by expanding the implications: 


(P > Q) AGP > Q) = (FP VQ) A(PVQ) 


If we then attempt to simplify this by applying, say, the distributive law, 
it makes things worse: 


(=P VQ) A(PV Q) 5 (4PAP)V (=PAQ)V(QAP)V (QAQ) 
=0V(APAQ)V(QAP)VQ 
= (=PAQ)V(QAP)VQ. 


Now the proposition is in disjunctive normal form, which means it’s 
an OR of ANDs. If we look closely at the clauses, we realize that the Q clause 
by itself controls the outcome of the OR, since if either of the other clauses 
are true, so is Q.° So in fact a simpler CNF version of this proposition is 
just Q alone, which is a not very big AND over a single not very big OR 
clause. Having simplified to Q, we realize that what the master just said was 
“T will cut off your head.” It’s time for the student to draw his sword! 

CNF formulas are particularly useful because they support resolution 
(see §2.4.1). Using the tautology (P VQ) A (=P V R) > QV R, we can 
construct proofs from CNF formulas by looking for occurrences of some 
simple proposition and its negation and resolving them, which generates a 
new clause we can add to the list. For example, we can compute 


(PVQ A (AQ V R) 

L(PVQ A(=AQV R)AQ 
L(PVQ A(-QV R)AQAR 
L(PVQ A(AQVR)AQARAP 
LP. 


A (PV AR) A\(-P VQ 
A(PV->AR)A (=P VQ 
A (PV AR) A (AP VQ 
A (PV =AR)A (=P VQ 


Seer RY SN. 
a a 


This style of proof is called a resolution proof. Because of its simplicity 
it is particularly well-suited for mechanical theorem provers. Such proofs 


°This is kind of a handwavy argument. If we want to justify this claim formally, we 
could write out a truth table. 
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can also encode traditional proofs based on modus ponens: the inference 
PA(P > QQ) Q can be rewritten as resolution by expanding — to get 
PREP YOO: 

Similarly, a compound proposition is in disjunctive normal form 
(DNF) if it consists of an OR of ANDs, e.g. (PA Q)V (PAAR) V (=P AQ). 
Just as any compound proposition can be transformed into CNF, it can 
similarly be transformed into DNF. DNF is sometimes easier to compute from 
truth tables, since we can include a AND clause recognizing each assignment 
that produces a 1, and OR them together. But this may not give us a very 
concise DNF. 

Note that conjunctive and disjunctive normal forms are not unique; for 
example, PA Q and (PV 7>Q) A(PV Q) A (=P V Q) are both in conjunctive 
normal form and are logically equivalent to each other. So while CNF can be 
handy as a way of reducing the hairiness of a formula (by eliminating nested 
parentheses or negation of non-variables, for example), it doesn’t necessarily 
let us see immediately if two formulas are really the same. 


2.3. Predicate logic 


Using only propositional logic, we can express a simple version of a famous 
argument: 


e Socrates is a man. 
e If Socrates is a man, then Socrates is mortal. 
e Therefore, Socrates is mortal. 


This is an application of the inference rule called modus ponens, which 
says that from p and p — q you can deduce q. The first two statements are 
axioms (meaning we are given them as true without proof), and the last is 
the conclusion of the argument. 

What if we encounter Socrates’s infinitely more logical cousin Spocrates? 
We’d like to argue 


e Spocrates is a man. 
e If Spocrates is a man, then Spocrates is mortal. 


e Therefore, Spocrates is mortal. 


°Using De Morgan’s laws, the same works for CNF, where we include for each assignment 
that produces a 0 an OR clause that is false for that row. 
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Unfortunately, the second step depends on knowing that humanity implies 
mortality for everybody, not just Socrates. If we are unlucky in our choice of 
axioms, we may not know this. What we would like is a general way to say 
that humanity implies mortality for everybody, but with just propositional 
logic, we can’t write this fact down. 


2.3.1 Variables and predicates 


The solution is to extend our language to allow formulas that involve variables. 
So we might let xz, y, z, etc. stand for any element of our universe of 
discourse or domain—essentially whatever things we happen to be talking 
about at the moment. We can now write statements like: 


e “x is human.” 
e “x is the parent of y.” 
e “x 9 = 2” 


These are not propositions because they have variables in them. Instead, 
they are predicates; statements whose truth-value depends on what concrete 
object takes the place of the variable. Predicates are often abbreviated by 
single capital letters followed by a list of arguments, the variables that 
appear in the predicate, e.g.: 


e H(x) = “x is human.” 
e P(x,y) = “x is the parent of y.” 
e Q(x) =“x+2= 27” 


We can also fill in specific values for the variables, e.g. H(Spocrates) = 
“Spocrates is human.” If we fill in specific values for all the variables, we have 
a proposition again, and can talk about that proposition being true (e.g. 
Q(2) and Q(-—1) are true) or false (Q(0) is false). 

In first-order logic, which is what we will be using in this course, 
variables always refer to things and never to predicates: any predicate 
symbol is effectively a constant. There are higher-order logics that allow 
variables to refer to predicates, but most mathematics accomplishes the same 
thing by representing predicates with sets (see Chapter 3). 
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2.3.2 Quantifiers 


What we really want is to be able to say when H or P or Q is true for many 
different values of their arguments. This means we have to be able to talk 
about the truth or falsehood of statements that include variables. To do this, 
we bind the variables using quantifiers, which state whether the claim we 
are making applies to all values of the variable (universal quantification), 
or whether it may only apply to some (existential quantification). 


2.3.2.1 Universal quantifier 


The universal quantifier V (pronounced “for all”) says that a statement 
must be true for all values of a variable within some universe of allowed 
values (which is often implicit). For example, “all humans are mortal” could 
be written Vx : Human(z) — Mortal(x) and “if x is positive then x + 1 is 
positive” could be written Va: 2 >0>52+4+1>0. 

If you want to make the universe explicit, use set membership notation.’ 
An example would be Va € Z: 2 >0-—>2+1> 0. This is logically 
equivalent to writing Vx: 2 € Z— (x > 0 > «+1 > 0) or to writing 
Va: (x €ZAx>0) >2+4+1>0, but the short form makes it more clear 
that the intent of x € Z is to restrict the range of «.® 

The statement Vx : P(x) is equivalent to a very large AND; for example, 
Ya € N: P(x) could be rewritten (if you had an infinite amount of paper) 
as P(0) A P(1) A P(2) A P(3) A.... Normal first-order logic doesn’t allow 
infinite expressions like this, but it may help in visualizing what Vx : P(x) 
actually means. Another way of thinking about it is to imagine that x is 
supplied by some adversary and you are responsible for showing that P() is 
true; in this sense, the universal quantifier chooses the worst case value of zx. 


2.3.2.2 Existential quantifier 


The existential quantifier 4 (pronounced “there exists”) says that a state- 
ment must be true for at least one value of the variable. So “some human is 
mortal” becomes Jz : Human(x) A Mortal(x). Note that we use AND rather 
than implication here; the statement Jz : Human(x) > Mortal(a) makes the 
much weaker claim that “there is some thing x, such that if x is human, then 
x is mortal,” which is true in any universe that contains an immortal purple 
penguin—-since it isn’t human, Human(penguin) — Mortal(penguin) is true. 


"See Chapter 3. 
’Programmers will recognize this as a form of syntactic sugar. 
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As with V, 4 can be limited to an explicit universe with set membership 
notation, e.g., Ja € Z: a= 27. This is equivalent to writing dr: « € ZAx = 
2 
a 
The formula Jz : P(x) is equivalent to a very large OR, so that dx EN: 
P(x) could be rewritten as P(0) V P(1) V P(2) V P(3) V.... Again, you can’t 
generally write an expression like this if there are infinitely many terms, but 


it gets the idea across. 


2.3.2.3 Negation and quantifiers 


The following equivalences hold: 


Wa: P(x) = da: AP(z2). 
SG Pla) = Ve cae). 


These are essentially the quantifier version of De Morgan’s laws: the first 
says that if you want to show that not all humans are mortal, it’s equivalent 
to finding some human that is not mortal. The second says that to show 
that no human is mortal, you have to show that all humans are not mortal. 


2.3.2.4 Restricting the scope of a quantifier 


Sometimes we want to limit the universe over which we quantify to some 
restricted set, e.g., all positive integers or all baseball teams. We’ve previously 
seen how to do this using set-membership notation, but can also do this for 
more general predicates either explicitly using implication: 


Va:x2>O0>2-12>0 


or in abbreviated form by including the restriction in the quantifier 
expression itself: 
Va >0O:x-1>0. 


Similarly 


de:r>O0Aa?=81 


can be written as 


de >0:2? =81. 


Note that constraints on 4 get expressed using AND rather than implica- 
tion. 

The use of set membership notation to restrict a quantifier is a special 
case of this. Suppose we want to say that 79 is not a perfect square, by which 
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we mean that there is no integer whose square is 79. If we are otherwise 
talking about real numbers (two of which happen to be square roots of 79), 
we can exclude the numbers we don’t want by writing 


adx €Z: 22 =79 
which is interpreted as 


ada: (2 € ZA x* = 79) 


or, equivalently 
Va:2€ Za? £79. 


Here Z = {...,—2,—1,0,1,2,...} is the standard set of integers. 
For more uses of €, see Chapter 3. 


2.3.2.5 Nested quantifiers 


It is possible to nest quantifiers, meaning that the statement bound by a 
quantifier itself contains quantifiers. For example, the statement “there is no 
largest prime number” could be written as 


ada : (Prime(x) A Vy: y > « > 7Prime(y)) 


i.e., “there does not exist an x that is prime and any y greater than zx is 
not prime.” Or in a shorter (though not strictly equivalent) form: 


Vady:y > x Prime(y) 


which we can read as “for any «x there is a bigger y that is prime.” 

To read a statement like this, treat it as a game between the V player 
and the J player. Because the V comes first in this statement, the for-all 
player gets to pick any z it likes. The exists player then picks a y to make 
the rest of the statement true. The statement as a whole is true if the J 
player always wins the game. So if you are trying to make a statement true, 
you should think of the universal quantifier as the enemy (the adversary 
in algorithm analysis) who says “nya-nya: try to make this work, bucko!”, 
while the existential quantifier is the friend who supplies the one working 


response. 
As in many two-player games, it makes a difference who goes first. If we 
write likes(x, y) for the predicate that x likes y, the statements 


Vay : likes(x, y) 
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and 


yVua : likes(x, y) 


mean very different things. The first says that for every person, there 
is somebody that that person likes: we live in a world with no complete 
misanthropes. The second says that there is some single person who is so 
immensely popular that everybody in the world likes them. The nesting 
of the quantifiers is what makes the difference: in Vay : likes(x, y), we are 
saying that no matter who we pick for x, dy : likes(z, y) is a true statement; 
while in 4yVz : likes(x, y), we are saying that there is some y that makes 
Va : likes(x, y) true. 

Naturally, such games can go on for more than two steps, or allow the 
same player more than one move in a row. For example 


2 


VaVysz: 2° +y=2z 


is a kind of two-person challenge version of the Pythagorean theorem 
where the universal player gets to pick x and y and the existential player has 
to respond with a winning z. (Whether the statement itself is true or false 
depends on the range of the quantifiers; it’s false, for example, if x, y, and z 
are all natural numbers or rationals but true if they are all real or complex. 
Note that the universal player only needs to find one bad (, y) pair to make 
it false.) 

One thing to note about nested quantifiers is that we can switch the 
order of two universal quantifiers or two existential quantifiers, but we can’t 
swap a universal quantifier for an existential quantifier or vice versa. So 
for example VaVy : (x = y ~ «+1 = y +1) is logically equivalent to 
VyVva : (@=y>yt1l=2+4+1), but Vrdy: y < x is not logically equivalent 
to dyVa:y <a. This is obvious if you think about it in terms of playing 
games: if I get to choose two things in a row, it doesn’t really matter which 
order I choose them in, but if I choose something and then you respond it 
might make a big difference if we make you go first instead. 

One measure of the complexity of a mathematical statement is how many 
layers of quantifiers it has, where a layer is a sequence of all-universal or 
all-existential quantifiers. Here’s a standard mathematical definition that 
involves three layers of quantifiers, which is about the limit for most humans: 


[lim f(a) =y] = |Ve>0:4N : V2 >N:|f()-yl <d. 


LOO 


Now that we know how to read nested quantifiers, it’s easy to see what 
the right-hand side means: 
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1. The adversary picks €, which must be greater than 0. 
2. We pick N. 
3. The adversary picks x, which must be greater than N. 
4. We win if f(x) is within € of y. 
So, for example, a proof of 
jim 1 {£20 
would follow exactly this game plan: 
1. Choose some € > 0. 


2. Let N > 1/e. (Note that we can make our choice depend on previous 
choices.) 


3. Choose any x > N. 


4. Thenz > N > 1/e>0, s01/4 <1/N <e€- |1/2 —0| < «. QED! 


2.3.2.6 Examples 


Here we give some more examples of translating English into statements in 
predicate logic. 


All crows are black. Va : Crow(x) + Black(zx) 


The formula is logically equivalent to either of 


adxCrow(x) \ =Black(zx) 
or 
Va : aBlack(2) - —Crow(z). 


The latter is the core of a classic “paradox of induction” in philosophy: 
if seeing a black crow makes me think it’s more likely that all crows are 
black, shouldn’t seeing a logically equivalent non-black non-crow (e.g., a 
banana yellow AMC Gremlin) also make me think all non-black objects are 
non-crows, i.e., that all crows are black? The paradox suggests that logical 
equivalence works best for true/false and not so well for probabilities. 
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Some cows are brown. dz : Cow(x) A Brown(z) 


No cows are blue. az : Cow(x) A Blue(z) 


Some other equivalent versions: 


Ya : a(Cow(x) A Blue(x) 
Va : (=Cow(x) V —=Blue(x)) 


Va : Cow(x) + —=Blue() 
Va : Blue(x) > =Cow(z). 
All that glitters is not gold. “Va : Glitters(x) + Gold(x) 


Or dz : Glitters(x) \ sGold(x). Note that the English syntax is a bit 
ambiguous: a literal translation might look like Vx : Glitters(x) + —Gold(z), 
which is not logically equivalent. This is an example of how predicate logic 
is often more precise than natural language. 


No shirt, no service. Va : aShirt(a) — —Served(z) 


Every event has a cause. Vady : Causes(y, x) 


And a more complicated statement: Every even number greater than 2 
can be expressed as the sum of two primes. 


Va: (Even(x) A x > 2) > (Apdq : Prime(p) A Prime(q) A («4 = p+ q)) 


The last one is Goldbach’s conjecture. The truth value of this state- 
ment is currently unknown. 


2.3.3 Functions 


A function symbol looks like a predicate but instead of computing a 
truth value it returns an object. Function symbols may take zero or more 
arguments. The special case of a function symbol with zero arguments is 
called a constant. 

For example, in the expression 2 + 2 = 5, we’ve got three constants 2, 2, 
and 5, a two-argument function +, and a predicate =, which has a special 
role in predicate logic that we’ll discuss in more detail below. 

The nice thing about function symbols is that they let us populate our 
universe without having to include a lot of axioms about various things 
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existing. The convention is that anything we can name exists. An example 
is the construction of the natural numbers 0,1,2,... used with the Peano 
axioms: these are represented using the constant 0 and the successor 
function S, so that we can count 0, 50,550, S'SS0, and so on. 

Note however that there is no guarantee that two objects constructed in 
different ways are actually distinct (2+ 2 = 4 after all). To express whether 
objects are the same as each other or not requires a dedicated equality 
predicate, discussed below. 


2.3.4 Equality 


The equality predicate =, written 7 = y, is typically included as a standard 
part of predicate logic. The interpretation of x = y is that x and y are 
the same element of the domain. Equality satisfies the reflexivity axiom 
Va :a2= 2x and the substitution axiom schema: 


VaVy : («@ = y > (Px © Py)) 


where P is any predicate. This immediately gives a substitution rule 
that says x = y, P(x) + P(y). It’s likely that almost every proof you ever 
wrote down in high school algebra consisted only of many applications of the 
substitution rule. 

Example: We’ll prove VzVy : (« = y > y = «) from the above axioms 
(this property is known as symmetry). Apply substitution to the predicate 
Pz=z=2 to get VeVy: (t@=y > (@=axHy=2)). Use reflexivity 
to rewrite this as VazVy : (x = y > (1 © y = 2)), which simplifies to 
VaVy: (@=yry=2). 

Exercise: Prove VaVyVz : (a =yAy=2z—742=2). (This property is 
known as transitivity. ) 


2.3.4.1 Uniqueness 


Ph 


The abbreviation !aP(x) says “there exists a unique x such that P(x)? 
This is short for 


da(P(x) A (Vy: Ply) > 2=y)), 


which we can read as “there is an x for which P(x) is true, and any y for 
which P(y) is true is equal to x.” 

An example is a! : r+ 1 = 12. To prove this we’d have to show not only 
that there is some x for which x + 1 = 12 (11 comes to mind), but that if 
we have any two values x and y such that «+ 1 = 12 and y+1 = 12, then 
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x = y (this is not hard to do, assuming we have at our disposal the usual 
axioms of arithmetic). So the exclamation point encodes quite a bit of extra 
work, which is why we often hope that da: «+ 1 = 12 is good enough and 
pull out 4! only if we have to. 

There are several equivalent ways to expand 4!zP(x). Applying contra- 
position to P(y) > x = y gives 


NeP(z) =sa( Pa) A Wyse A yor), 


which says that any y that is not x doesn’t satisfy P. We can also play some 
games with De Morgan’s laws to turn this into 
AleP (a) = an(P(2) A (syste 4 yA Py))). 


This says that there is an x with P(x), but there is no y 4 x with P(y). 
All of these are just different ways of saying that x is the only object that 
satisfies P. 


2.3.5 Models 


In propositional logic, we can build truth tables that describe all possible 
settings of the truth-values of the literals. In predicate logic, the analogous 
concept to an assignment of truth-values is a structure. A structure consists 
of a set of objects or elements (built using set theory, as described in 
Chapter 3), together with a description of which elements fill in for the 
constant symbols, which predicates hold for which elements, and what 
the value of each function symbol is when applied to each possible list of 
arguments (note that this depends on knowing what constant, predicate, 
and function symbols are available—this information is called the signature 
of the structure). A structure is a model of a particular theory (set of 
statements), if each statement in the theory is true in the model. 

In general we can’t hope to find all possible models of a given theory. 
But models are useful for two purposes: if we can find some model of a 
particular theory, then the existence of this model demonstrates that the 
theory is consistent; and if we can find a model of the theory in which some 
additional statement S' doesn’t hold, then we can demonstrate that there is 
no way to prove S from the theory (i.e. it is not the case that T+ S, where 
T is the list of axioms that define the theory). 


2.3.5.1 Examples 


e Consider the axiom —4z. This axiom has exactly one model (it’s 
empty). 
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Now consider the axiom 4!2, which we can expand out to drVyy = =. 
This axiom also has exactly one model (with one element). 


e We can enforce exactly & elements with one rather long axiom, e.g. for 
k = 3 do drysrgdr3Vy > y= a, Vy = toVyY = 243A FXQAX%2F 
x3 \ x3 4 x1. In the absence of any special symbols, a structure of 3 
undifferentiated elements is the unique model of this axiom. 


e Suppose we add a predicate P and consider the axiom JrPzr. Now 
we have many models: take any nonempty model you like, and let 
P be true of at least one of its elements. If we take a model with 
two elements a and b, with Pa and —Pb, we see that dazPzx is not 
enough to prove VxPx, since dx Px is true in the model but VxPz isn’t. 
Conversely, an empty model satisfies Va Px = —JxzPx but not Ax Px. 


Now let’s bring in a function symbol S and constant symbol 0. Consider 
a stripped-down version of the Peano axioms that consists of just the 
axiom VaVy : Sx = Sy ~ « = y. Both the natural numbers N and 
the integers Z are a model for this axiom, as is the set Z,, of integers 
mod m for any m (see §8.3). In each case each element has a unique 
predecessor, which is what the axiom demands. If we throw in the first 
Peano axiom Vx: Sx £0, we eliminate Z and Z,, because in each of 
these models 0 is a successor of some element. But we don’t eliminate 
a model that consists of two copies of N sitting next to each other (only 
one of which contains the “official” 0), or even a model that consists of 
one copy of N (that includes the official 0 with no predecessor) plus 
any number of copies of N, Z, and Zm. 


e A practical example: The family tree of the kings of France is a 
model of the theory containing the two axioms VzVyVzParent(x, y) A 
Parent(y, z) + GrandParent(z, z) and VzVyParent(z, y) ~ —Parent(y, x). 
But this set of axioms could use some work, since it still allows for 
the possibility that there are some x and y for which Parent(z, y) and 
GrandParent(y, z) are both true. 


2.4 Proofs 


A proof is a way to derive statements from other statements. It starts with 
axioms (statements that are assumed in the current context always to be 
true), theorems or lemmas (statements that were proved already; the 
difference between a theorem and a lemma is whether it is intended as a final 
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result or an intermediate tool), and premises P (assumptions we are making 
for the purpose of seeing what consequences they have), and uses inference 
rules to derive Q. The axioms, theorems, and premises are in a sense the 
starting position of a game whose rules are given by the inference rules. The 
goal of the game is to apply the inference rules until Q pops out. We refer to 
anything that isn’t proved in the proof itself (i.e., an axiom, theorem, lemma, 
or premise) as a hypothesis; the result Q is the conclusion. 


When a proof exists of Q from some premises P;, P2,..., we say that Q 
is deducible or provable from P,, P2,..., which is written as 
Pi, Po,...F Q. 


If we can prove Q directly from our inference rules without making any 
assumptions, we may write 


FQ 


The turnstile symbol | has the specific meaning that we can derive 
the conclusion Q by applying inference rules to the premises. This is not 
quite the same thing as saying P > Q. If our inference rules are particularly 
weak, it may be that P > Q is true but we can’t prove Q starting with 
P. Conversely, if our inference rules are too strong (maybe they can prove 
anything, even things that aren’t true) we might have P+ Q but P > Q is 
false. 

For propositions, most of the time we will use inference rules that are just 
right, meaning that P+ Q implies that P > Q is a tautology, (soundness) 
and P > Q being a tautology implies that P + Q (completeness). Here 
the distinction between - and — is whether we want to talk about the 
existence of a proof (the first case) or about the logical relation between two 
statements (the second). 

Things get a little more complicated with statements involving predicates. 
For predicate logic, there are incompleteness theorems that say that if 
our system of axioms is powerful enough (basically capable of representing 
arithmetic), then there are are statements P such that neither of P or ~P 
are provable unless the theory is inconsistent. 


2.4.1 Inference Rules 


Inference rules let us construct valid arguments, which have the useful 

property that if their premises are true, their conclusions are also true. 
The main source of inference rules is tautologies of the form P; \ P2...— 

Q; given such a tautology, there is a corresponding inference rule that allows 
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us to assert @ once we have P,, Po,.... Given an inference rule of this form 
and a goal Q, we can then look for ways to show P;, P2,... all hold, either 
because each P; is an axiom/theorem/premise or because we can prove it 
from other axioms, theorems, or premises. 

The most important inference rule is modus ponens, based on the 
tautology (pA (p > q)) > q; this lets us, for example, write the following 
famous argument:? 


1. If it doesn’t fit, you must acquit. [Axiom] 
2. It doesn’t fit. [Premise] 
3. You must acquit. [Modus ponens applied to 1+2] 


There are many named inference rules in classical propositional logic. 
We'll list some of them below. You don’t need to remember the names 
of anything except modus ponens, and most of the rules are pretty much 
straightforward applications of modus ponens plus some convenient tautology 
that can be proved by truth tables or stock logical equivalences. (For example, 
the “addition” rule below is just the result of applying modus ponens to p 
and the tautology p > (pV q).) 

Inference rules are often written by putting the premises above a hor- 
izontal line and the conclusion below. In text, the horizontal line is often 
replaced by the symbol -, which means exactly the same thing. Premises 
are listed on the left-hand side separated by commas, and the conclusion is 
placed on the right. We can then write 


prpvgd. Addition 

DAG p. Simplification 

POU pg. Conjunction 
Dpqr gq. Modus ponens 
7"q,p > qr 7p. Modus tollens 
P>QGqa7rrp+>r. Hypothetical syllogism 
pV q,7pr q. Disjunctive syllogism 
pVq,7pVrrqgvr. Resolution 


Of these rules, addition, simplification, and conjunction are mostly used 
to pack and unpack pieces of arguments. Modus ponens “the method of 
affirming” (and its reversed cousin modus tollens “the method of denying”) 


°Maybe not as famous as it once was. 
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let us apply implications. You don’t need to remember modus tollens if you 
can remember the contraposition rule (p > q) = (=q — 7p). Hypothetical 
syllogism just says that implication is transitive; it lets you paste together 
implications if the conclusion of one matches the premise of the other. 
Disjunctive syllogism is again a disguised version of modus ponens (via the 
logical equivalence (pV q) = (=p > q)); you don’t need to remember it if you 
can remember this equivalence. Resolution is almost never used by humans 
but is very popular with computer theorem provers. 

An argument is valid if the conclusion is true whenever the hypotheses 
are true. Any proof constructed using the inference rules is valid. It does not 
necessarily follow that the conclusion is true; it could be that one or more of 
the hypotheses is false: 


1. If you give a mouse a cookie, he’s going to ask for a glass of milk. 
[Axiom] 


2. If he asks for a glass of milk, he will want a straw. [Axiom] 

3. You gave a mouse a cookie. [Premise] 

4. He asks for a glass of milk. [Modus ponens applied to 1 and 3.] 
5. He will want a straw. [Modus ponens applied to 2 and 4.] 


Will the mouse want a straw? No: Mice can’t ask for glasses of milk, so 
Axiom 1 is false. 


2.4.2 Proofs, implication, and natural deduction 


Recall that P / Q means there is a proof of Q by applying inference rules 
to P, while P > Q says that Q holds whenever P does. These are not the 
same thing: provability (-) is outside the theory (it’s a statement about 
whether a proof exists or not) while implication (—) is inside (it’s a logical 
connective for making compound propositions). But most of the time they 
mean almost the same thing. 

For example, suppose that P — Q is provable without any assumptions: 


FP+>Q. 
Since we can always ignore extra premises, we get 


PFP+Q 


CHAPTER 2. MATHEMATICAL LOGIC 39 


and thus 
PEP,P>Q, 
which gives 
PFQ 


by applying modus ponens to the right-hand side. 

So we can go from P+ Qto PF Q. 

This means that provability is in a sense weaker than implication: it 
holds (assuming modus ponens) whenever implication does. But we usually 
don’t use this fact much, since P + Q is a much more useful statement than 
Pt! Q. Can we go the other way? 


2.4.2.1 The Deduction Theorem 


Yes, using the Deduction Theorem. 

Often we want to package the result of a proof as a theorem (a proven 
statement that is an end in itself) or lemma (a proven statement that is 
intended mostly to be used in other proofs). Typically a proof shows that, 
given some base assumptions I’, if certain premises P;, P2,...P, hold, then 
some conclusion Q holds (with various axioms or previously-established 
theorems assumed to be true from context). To use this result later, it is 
useful to be able to package it as an implication P,; A Po A...P, 7 Q. In 
other words, we want to go from 


Eg Py P yeh yk 


to 
TE(PLAP2A...A\ Pr) > Q. 


The statement that we can do this, for a given collection of inference 
rules, is the Deduction Theorem: 


Theorem 2.4.1 (Deduction Theorem). If there is a proof of Q from premises 
T,P,, Po,..., Pn, then there is a proof of Py \ PoA...A\ Py > Q from T 
alone. 


The actual proof of the theorem depends on the particular set of inference 
rules we start with, but the basic idea is that there exists a mechanical 
procedure for extracting a proof of the implication from the proof of Q 
assuming P, etc. 
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Caveat: In predicate logic, the deduction theorem only applies if none 
of the premises contain any free variables (which are variables that aren’t 
bound by a universal or existential quantifier). Usually you won’t run into 
this, but there are some bad cases that arise without this restriction. 


2.4.2.2 Natural deduction 


In practice, we usually don’t refer to the Deduction Theorem directly, and 
instead adopt a new inference rule: 


T,PFQ 


FrFP3Q On 


This says that if we can prove @ using assumptions [’ and P, then we 
can prove P > Q using just [. Note that the horizontal line acts like a 
higher-order version of -; it lets us combine one or more proofs into a new, 
bigger proof. 

This style of inference rule, where we explicitly track what assumptions 
go into a particular result, is known as natural deduction. The natural 
deduction approach was invented by Gentzen [Gen35a, Gen35b] as a way to 
make inference rules more closely match actual mathematical proof-writing 
practice than the modus-ponens-only approach that modern logicians had 
been using up to that point.!° 

The particular rule (— /) is called introducing implication. There is a 
corresponding rule for eliminating implication that is essentially just modus 
ponens: 


TRFP>Q THFP 

req 
If we want to be really systematic about things, we can rewrite most of our 
standard inference rules as introduction and elimination rules for particular 
operators. This can make them a bit easier to remember, since for each 


Boolean operator there is often an “obvious” introduction and elimination 
rule for it. See Table 2.4 for a list. 


(> E) 


2.4.3 Inference rules for equality 


The equality predicate is special, in that it allows for the substitution rule 


= Pn) Pe). 


See http://plato.stanford. edu/entries/proof-theory-development/ for a more 
detailed history of the development of proof theory in general and [Pel99] for a discussion 
of how different versions of proof theory have been adopted in textbooks. 
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TKP+Q TEP 
TEQ 
ThKP+Q TF-AQ 


[Tr -=P 


Al 


(- E}) 


(> E>) 


Table 2.4: Natural deduction: introduction and elimination rules 
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If we don’t want to include the substitution rule as an inference rule, we 
could instead represent it as an axiom schema: 


Va: Vy: ((2@ =yAP(x)) > Ply)). 


But this is messier. 
We can also assert x = x directly: 


Fe=2 


2.4.4 Inference rules for quantified statements 


Universal generalization If y is a variable that does not appear in I’, then 


PEP) 
PREYS: P(e) 
This says that if we can prove that some property holds for a “generic” 
y, without using any particular properties of y, then in fact the property 
holds for all possible x. 


In a written proof, this will usually be signaled by starting with some- 
thing like “Let y be an arbitrary [member of some universe]”. For 
example: Suppose we want to show that there is no biggest natural 
number, i.e. that Vn € N: dn’ EN: n’ > n. Proof: Let n be any 
element of N. Let n’ =n+1. Then n’ > n. (Note: there is also an 
instance of existential generalization here.) 


Universal instantiation In the other direction, we have 
Ver O(a) Ole): 


Here we go from a general statement about all possible values x to a 
statement about a particular value. Typical use: Given that all humans 
are mortal, it follows that Spocrates is mortal. 


Existential generalization This is essentially the reverse of universal in- 
stantiation: it says that, if c is some particular object, we get 


Ole) Fae Oe): 


The idea is that to show that Q(x) holds for at least one x, we can 
point to c as a specific example of an object for which Q holds. The 
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corresponding style of proof is called a proof by construction or 
proof by example. 


For example: We are asked to prove that there exists an even prime 
number. Look at 2: it’s an even prime number. QED. 


Not all proofs of existential statements are constructive, in the sense 
of identifying a single object that makes the existential statement 
true. An example is a well-known non-constructive proof that there 
are irrational numbers a and b for which a? is rational. The non- 
constructive proof is to consider fa If this number is rational, 


va\ ¥? 2 
it’s an example of the claim; if not, ( /2 = /2 = 2 works." 


Non-constructive proofs are generally not as useful as constructive 
proofs, because the example used in a constructive proof may have 
additional useful properties in other contexts. 


Existential instantiation dz : Q(x) + Q(c) for some c, where c is a 
new name that hasn’t previously been used (this is similar to the 
requirement for universal generalization, except now the new name is 
on the right-hand side). 


The idea here is that we are going to give a name to some c that satisfies 
Q(c), and we know that we can get away this because Jz : Q(x) says 
that some such thing exists. !? 


In a proof, this is usually signaled by “let x be...” or “call it 2.” For 
example: Suppose we know that there exists a prime number greater 
than 7. Let p be some such prime number greater than 7. 


In natural-deduction terms, we can think of these rules as introduction 
and elimination rules for V and &. Table 2.5 shows what these look like. 


2.5 Proof techniques 


A proof technique is a template for how to go about proving particular 
classes of statements: this template guides you in the choice of inference 


"For this particular claim, there is also a constructive proof: YP tai 3 [Sch01]. 
This is actually a fairly painful idea to formalize. One version in pure first-order logic 
is the axiom 


(va : (Q(#) > P)) A Ay: Q(y)) > P. 
Nobody but a logician would worry about this. 
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T- Pe 


TF V2: Px a) 
—— (VB) 
T a ey 
a ox 


Table 2.5: Natural deduction: introduction and elimination rules for quan- 
tifiers. For VI and 4E, c is a new symbol that does not appear in P or 
ii 


rules (or other proof techniques) to write the actual proof. This doesn’t 
substitute entirely for creativity (there is no efficient mechanical procedure 
for generating even short proofs unless P = NP), but it can give you some 
hints for how to get started. 

Table 2.6 gives techniques for trying to prove A > B for particular 
statements A and B. The techniques are mostly classified by the structure 
of B. Before applying each technique, it may help to expand any definitions 
that appear in A or B. 

These strategies are largely drawn from [Sol05], particularly the summary 
table in the appendix, which is the source of the order and organization of 
the table and the names of most of the techniques. The table omits some 
techniques that are mentioned in Solow [Sol05]: Direct Uniqueness, Indirect 
Uniqueness, and various max/min arguments. The remaining techniques 
mostly follow directly from the inference rules from the preceding section; 
an exception is induction, which will be discussed in Chapter 5. 

For other sources, Ferland [Fer08] has an entire chapter on proof tech- 
niques of various sorts. Rosen [Ros12] describes proof strategies in §§1.5-1.7 
and Biggs [Big02] describes various proof techniques in Chapters 1, 3, and 4; 
both descriptions are a bit less systematic than the ones in Solow or Ferland, 
but also include a variety of specific techniques that are worth looking at. 

If you want to prove A + B, the usual approach is to prove A > B and 
A+ B separately. Proving A + B and =A > -—B also works (because of 
contraposition). 
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Strategy 
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Conclude What to do/why it 


works 


Direct proof 


Contraposition 


Contradiction 


When Assume 
Try it first A 
B=-7Q =B 


When B = =Q,orwhen AA-7B 
you are stuck trying the 
other techniques. 


B 


aA 


False 


Apply inference rules 
to work forward from A 
and backward from B; 
when you meet in the 
middle, pretend that 
you were working for- 
ward from A all along. 
Apply any other tech- 
nique to show ~B + 
=A and then apply 
the contraposition rule. 
Sometimes called an in- 
direct proof although 
the term indirect proof 
is often used instead for 
proofs by contradiction 
(see below). 

Apply previous meth- 
ods to prove both P 
and —P for some P. 
Note: this can be a lit- 
tle dangerous, because 
you are assuming some- 
thing that is (probably) 
not true, and it can 
be hard to detect as 
you prove further false 
statements whether the 
reason they are false is 
that they follow from 
your false assumption, 
or because you made a 
mistake. Direct or con- 
traposition proofs are 
preferred because they 
don’t have this prob- 
lem. 
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Construction B= AEP (zs) A 
Counterexample B= -VxP(zx) A 
Choose B=Vz2(P(x) > Q(x)) A, P(c), 
where 
c is 
chosen 
arbitrar- 
ily. 
Instantiation A=VzP(zx) A 
Elimination B=CVD AA7AC 


P(c) for 
some 
specific 
object c. 
aP (c) 
for some 
specific 
object c. 
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Pick a likely-looking c 
and prove that P(c) 
holds. 


Pick a likely-looking c 
and show that —P(c) 
holds. This is identical 
to a proof by construc- 
tion, except that we 
are proving drz-P(z), 
which is equivalent to 
awe (a): 

Choose some c and 
assume A and P(c). 
Prove Q(c). Note: c 
is a placeholder here. 
If P(c) is “c is even” 
you can write “Let c 
be even” but you can’t 
write “Let c = 12”, 
since in the latter case 
you are assuming extra 
facts about c. 

Pick some particular 
c and prove that 
P(c) —B. Here you 
can get away with 
saying “Let c = 12.” 
(If c = 12 makes B 
true). 

The reason this works 
is that AA AC > D 
is equivalent to =(A A 
aC) > D=aAAVCV 
D=A-> (CVD). Of 
course, it works equally 
well if you start with 
AA-WD and prove C. 
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Case analysis A=CVD C,D B Here you write two sep- 
arate proofs: one that 
assumes C’ and proves 
B, and one that as- 
sumes D and proves B. 
A special case is when 
D=—C. You can also 
consider more cases, as 
long as A implies at 
least one of the cases 


holds. 
Induction B=Vz € NP(z) A P(0) If P(0) holds, and P(z) 
and implies P(x + 1) for 


Va EN: all a, then for any spe 

(P(x) > cific natural number n 

P(x + we can consider con- 

1)). structing a sequence of 
proofs P(0) + P(1) > 
P(2) > ... > Pin). 
(This is actually a defin- 
ing property of the nat- 
ural numbers.) 

Table 2.6: Proof techniques (adapted from [Sol05] 


2.6 Examples of proofs 


Real proofs by actual human mathematicians are usually written in a con- 
densed style that uses ordinary language, without trying to convert everything 
into logical notation. But in principle it should be possible to translate any 
such proof into a formal proof. In this section, we give some examples of 
what a condensed proof might look like, and explain how the steps used in 
such proofs correspond to inference rules we’ve already seen. 


2.6.1 Axioms for even numbers 


Let’s define what it means for a number to be even, where we use the Peano- 
axiom convention for writing numbers as 0, 50,550, etc. We will use the 
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following axioms for our definition, where Ea means that x is even: 


Ay Va: Exo (x =0V (Ay: EyAx = SSy)) 
Ag Va: 04 Sa. 
A3 VaVy: Sx = Sy > r=y. 
Here A, is the definition of Ex and Ag and Ag are general axioms about 
S that we are throwing in because we will need them in some of our proofs. 
2.6.2 A theorem and its proof 
Now let’s prove this exciting theorem: 
Theorem 2.6.1. All of the following statements are true: 
1. EO. 
2. aE(S0). 
3. E(SS0). 
4. aE(SS'S0). 
5. E(SSSSO0). 
Proof. 1. Axiom A, says that x is even if it is 0. 


2. Suppose E(S0) holds. Then either SO = 0 or SO = SSy for some 
y such that Ey holds. The first case contradicts Ag; in the second 
case, applying A3 gives that SO = SSy implies 0 = Sy, which again 
contradicts Ag. So in either case we arrive at a contradiction, and our 
original assumption that E'(.S0) is true does not hold. 


(This is an example of an indirect proof.) 


3. From A; we have that E(5'S0) holds if there exists some y such that 
Ey and SS0= SSy. Let y = 0. 


4. We have previously established ~E(S0). We also know that SSS0 # 0, 
so E(SS'S0) is true if and only if SSSO = S'Sy for some y with Ey. 
Applying As twice gives SS'S0 = SSy iff SO = y. But we already 
showed ~E(S0), so ~E(SSS0). 


5. Since B(S'S0) and SS'S0 = SS(S'S0), E(SSS'S0). 
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The nice thing about proving all of these facts at once is that as we prove 
each one we can use that fact to prove the later ones. From a purely stylistic 
point of view, we can also assume that the reader is probably starting to 
catch on to some of the techniques we are using, which is why the argument 
for E(.SSSS0) is so succinct compared to the argument for E(.S'S0). 

If we had to expand these arguments out using explicit inference rules, 
they would take longer, but we could do it. Let’s try this for the proof of 
aE ($0). We are trying to establish that A, Ag, A3 / ~E(S0). Abbreviating 
Aj, Az, A3 as T, the strategy is to show that [+ E(.S0) > Q for some Q 
with T+ =Q; we can then apply the > £» rule (aka modus tollens) to get 
TFA ($0): 

Formally, this looks like: 


1. T+ £(S0) © (S0 =0V ay: (Ey A SO = SSy)). (VE applied to A.) 


2. TF E(S0) > (S0 =0V ay: (Ey A S0 = SSy)). (Expand © and use 
one of the A elimination rules.) 


T, E($0) + SO =0V Ay: (Ey A S0= SSy). (> E). 
4. T, E(S0) + =(S0 = 0). (Apply VE to Ag.) 
T, B( 


S0)F dy: (EyA S0 = SSy). (Combine last two steps using VF.) 


6. T, E(S0) + Ez A S0 = SSz. (This is SE. In the condensed proof we 
didn’t rename y, but calling it z here makes it a little more obvious 
that we are fixing some particular constant.) 


F SO = SSz. (A£;.) 
+ SO = SSz 4+ 0= Sz. (Apply VE to As). 


) 
) 

9. T, E(S0) + SO = SSz +0 = Sz. (Another expansion plus AF). 
) 


+ 0= Sz. (Apply > FE; to SO = SSz and S0= SSz > 0= 


(i. TEES OSs GH 
12. Tk} 7~(0 = Sz). (VE and Ad.) 


i TPE SEO): Ge BS 
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One thing to notice about the formal argument is how E(S0) moves in 
and and out of the left-hand side of the turnstile in the middle of the proof. 
This is a pretty common trick, and is what is going on whenever you read 
a proof that says something like “suppose P holds” or “consider the case 
where P holds.” Being able to just carry P (in this case, E(S0)) around as 
an assumption saves a lot of writing “if P” over and over again, and more 
formally is what allows us to unpack P —- Q and apply inference rules to Q. 


2.6.3 A more general theorem 


So far we have only proved results about a few specific numbers. Can we say 
anything about all numbers? Let’s try to prove the following theorem: 


Theorem 2.6.2. For all x, if x is even, SSSSzx is even. 


Proof. Let x be even. Then $'Sz is even (Axiom A;), and so SS(S'S) = 
SSS'Sx is also even. 


Written out using natural-deduction inference rules (with some of the 
more boring steps omitted), the proof would look like this: 


1. T,Eat (ay: EyA SSa = SSy) > E( SS). (Axiom Aji, VE, VE\.) 
2. T,Eat Ex. 
3. T, Ext SSxz = SS. (Reflexivity of =.) 


4. T,Eat Ex \ SSx = SSzx. (AI applied to previous two steps.) 


§. T, Eek ay: Ey A SSy = SSa. (Let y=.) 


6. [, Ext E(SSzx). (Modus ponens!) 


7. T, Ext E(SSSSz2). (Do it all again to show E(SSx) > E(SSSSzx). 
This is the boring part we promised to omit.) 


8. [+ Ex > E(SSSSzx). (- I.) 
9. TtVae: Ex > E(SSSSza). (VI). 


If we had to write all the boring parts out, it might make sense to first 
prove a lemma Va: Ex > E(SSzx) and then just apply the lemma twice. 

The instruction “let x be even” is doing a lot of work in the condensed 
proof: it is introducing both a new name «x that we will use for the Universal 
Generalization rule VE, and the assumption that x is even that we will use 
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for the Deduction Theorem > E. Note that we can’t apply VE until we’ve 
moved the assumption Ex out of the left-hand side of the turnstile, because 
Universal Generalization only works if x is not a name mentioned in the 
assumptions. 


2.6.4 Something we can’t prove 


One thing we probably know about the natural numbers is that if x is even, 
then « + 1 is odd, and vice versa. As a theorem this would look like 


Claim 2.6.3. For all x, Ex << 7E(Sz2). 


Unfortunately our axiom system is not strong enough to prove this claim. 
Here is a model that satisfies the axioms but for which the claim fails: 


1. Include the ordinary natural numbers 0, $0, S.S0, etc. with £0, ~E(S0), E(SS0), 
etc. 


2. Include an extra unnatural number u such that u= Su and Fu holds. 


It turns out that adding u doesn’t violate any of the axioms. Axiom A, 
is happy, because Eu + E(SSu) since both u and SSu are even. Axiom A» 
is happy because 0 4 Su. Axiom A3 is happy because if Sx = Syou=y 
whenever x and y are both natural or both u, and also if one is natural and 
one is u (because in this case Sa # Sy and x # y). 

But: with u in the model, we have an object for which Eu and E(Su) 
are both true, contradicting the claim! So if we want the successor to any 
even number to be odd, we are going to need a bigger set of axioms. 

What we are really missing here is the Axiom Schema of Induction, 
which says that if P(0) and Vz : P(x) — P(Szx), then Vx : P(x). Note 
that throwing in the Axiom Schema of Induction actually requires adding 
infinitely many axioms, since we get a distinct axiom for each choice of 
formula P. 


Chapter 3 


Set theory 


Set theory is the dominant foundation for mathematics. The idea is that 
everything else in mathematics—numbers, functions, etc.—can be written in 
terms of sets, so that if you have a consistent description of how sets behave, 
then you have a consistent description of how everything built on top of 
them behaves. If predicate logic is the machine code of mathematics, set 
theory would be assembly language. 

The nice thing about set theory is that it requires only one additional 
predicate on top of the standard machinery of predicate logic. This is the 
membership or element predicate €, where x € S means that x is an 
element of S. Here S is a set—a collection of elements—and the identity of 
S is completely determined by which x satisfy « € S. Every other predicate 
in set theory can be defined in terms of €. 

We'll describe two versions of set theory below. The first, naive set 
theory, treats any plausible collection of elements as a set. This turns out 
to produce some unfortunate paradoxes, so most mathematics is built on a 
more sophisticated foundation known as axiomatic set theory. Here we 
can only use those sets whose existence we can prove using a standard list of 
axioms. But the axioms are chosen so that all the normal things we might 
want to do with sets in naive set theory are explicitly possible. 


3.1 Naive set theory 


Naive set theory is the informal version of set theory that corresponds to 
our intuitions about sets as unordered collections of objects (called elements) 
with no duplicates. An element of a set may also be a set (in which case it 
contains its own elements), or it may just be some object that is not a set 
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(also known as an urelement, which is German for “primitive element”). 
A set can be written explicitly by listing its elements using curly braces: 


e {} =the empty set (), which has no elements. 


{Moe, Curly, Larry} = the Three Stooges. 


{0,1,2,...} = N, the natural numbers. Note that we are relying on 
the reader guessing correctly how to continue the sequence here. 


{{}, {0} , {1}, {0,1} , {0,1,2},7} = a set of sets of natural numbers, 
plus a stray natural number that is directly an element of the outer 
set. 


Membership in a set is written using the € symbol (pronounced “is an 
element of,” “is a member of,” or just “is in”). So we can write Moe € the 
Three Stooges or 4 € N. We can also write ¢ for “is not an element of,” as 
in Moe ¢ N, and the reversed symbol 3 for “has as an element,” as in N 5 4. 

A fundamental axiom in set theory (the Axiom of Extensionality; see 
§3.4) is that the only distinguishing property of a set is its list of members: 
if two sets have the same members, they are the same set. 

For nested sets like {{1}}, € represents only direct membership: the 
set {{1}} only has one element, {1}, so 1 ¢ {{1}}. This can be con- 
fusing if you think of € as representing the English “is in,” because if 
I put my lunch in my lunchbox and put my lunchbox in my backpack, 
then my lunch is in my backpack. But my lunch is not an element of 
{{my lunch} , my textbook, my slingshot}. In general, € is not transitive 
(see §9.3): it doesn’t behave like < unless there is something very unusual 
about the set you are applying it to. There is also no standard notation for 
being a deeply-buried element of an element of an element (etc.) of some set. 

In addition to listing the elements of a set explicitly, we can also define 
a set by set comprehension, where we give a rule for how to generate 
all of its elements. This is pretty much the only way to define an infinite 
set without relying on guessing, but can be used for sets of any size. Set 
comprehension is usually written using set-builder notation, as in the 
following examples: 


e{z|reENAzc>r1lAWeEN: VzEN: yz=e25y=1Vz=1)}=the 
prime numbers. 


e {2x | xe N} = the even numbers. 


e {x|xENAz < 12} = {0,1,2,3,4,5,6,7,8,9, 10, 11}. 
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{x|0<a< 100, x=1 (mod 2) } 
[x | x <- [0..100], x ‘mod‘ 2 == 1 ] 
[ x for x in range(0,101) if x % 2 == 1 ] 


Table 3.1: Set comprehension vs list comprehension. The first line gives the 
set of odd numbers between 0 and 100 written using set-builder notation. 
The other lines construct the odd numbers between 0 and 100 as ordered list 
data structures in Haskell and Python respectively. 


Some very high-level programming languages like Haskell or Python have 
a similar mechanism called list comprehension which does pretty much 
the same thing except the result is an ordered list. Table 3.1 gives some 
examples of what this looks like. 

Sometimes the original set that an element has to be drawn from is put 
on the left-hand side of the vertical bar: 


e {nEN|ar,y,z2 EN\ {0}: a" +y" =2"}. This is a fancy name for 
{1,2}, but this fact is not obvious [Wil95]. 


Using set comprehension, we can see that every set in naive set theory 
is equivalent to some predicate. Given a set S, the corresponding predicate 
is « € S, and given a predicate P, the corresponding set is {x | Px}. But 
watch out for Russell’s paradox: what is {S| 5 ¢S}? 


3.2 Operations on sets 


If we think of sets as representing predicates, each logical connective gives 
rise to a corresponding operation on sets: 


e AUB={xr|xe€ AV«e B}. The union of A and B. 


ANB={a|xe€AAze€ B}. The intersection of A and B. 


A\ B={x|xeAAa ¢ B}. The set difference of A and B. 


AAB = {x|xeA@a2ze B}. The symmetric difference of A and 
B. 


(Of these, union and intersection are the most important in practice.) 
Corresponding to implication is the notion of a subset: 


e ACB (“Ais a subset of B”) if and only ifVa:cre Ace B. 
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As with €, C can be reversed: A > B means that A is a superset of B, 
which is the same as saying B C A. We can also write A Z B to say that A 
is a not a subset of B, and the rather awkward-looking A ¢ B to say that A 
is a proper subset of B, meaning that A C B but A # B. (The standard 
version A C B allows the case A = B.) 

Sometimes one says A is contained in B if A C B. This is one of two 
senses in which A can be “in” B—it is also possible that A is in fact an 
element of B (A € B). For example, the set A = {12} is an element of the 
set B = {Moe, Larry, Curly, {12}}, but A is not a subset of B, because A’s 
element 12 is not an element of B. Usually we will try to reserve “is in” for 
€ and “is contained in” for C, but it’s safest to use the symbols (or “is an 
element/subset of”) to avoid any possibility of ambiguity. 

Finally we have the set-theoretic equivalent of negation: 


e A={x|a¢ A}. The set A is known as the complement of A. 


If we allow complements, we are necessarily working inside some fixed 
universe, since the complement U = @) of the empty set contains all possible 
objects. This raises the issue of where the universe comes from. One approach 
is to assume that we’ve already fixed some universe that we understand (e.g. 
N), but then we run into trouble if we want to work with different classes 
of objects at the same time. The set theory used in most of mathematics is 
defined by a collection of axioms that allow us to construct, essentially from 
scratch, a universe big enough to hold all of mathematics without apparent 
contradictions while avoiding the paradoxes that may arise in naive set 
theory. However, one consequence of this construction is that the universe 
is (a) much bigger than anything we might ever use, and (b) not a set, 
making complements not very useful. The usual solution to this is to replace 
complements with explicit set differences: U \ A for some specific universe U 
instead of A. 


3.3. Proving things about sets 


We have three predicates so far in set theory, so there are essentially three 
positive things we could try to prove about sets: 


1. Given x and S, show x € S. This requires looking at the definition of 
S to see if x satisfies its requirements, and the exact structure of the 
proof will depend on what the definition of S is. 


2. Given S and T, show S C T. Expanding the definition of subset, this 
means we have to show that every xz in S is also in T. So a typical 
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proof will pick an arbitrary x in S and show that it must also be an 
element of T. This will involve unpacking the definition of S and using 
its properties to show that x satisfies the definition of T’. 


3. Given S and T, show S = T. Typically we do this by showing S C T 
and T C S separately. The first shows that Vz: 2 € S > x € T; the 
second shows that Va :2¢€7T—>a€S. Together, xe S>a2€T 
andxe€T>ae€eS givesxe Sox e€T, which is what we need for 
equality. 


There are also the corresponding negative statements: 


1. For x ¢ S, use the definition of S as before. 


2. For S ZT, we only need a counterexample: pick any one element of S' 
and show that it’s not an element of T. 


3. For SAT, prove one of SZ TorT ZS. 


Note that because S Z T and S # T are existential statements rather 
than universal ones, they tend to have simpler proofs. 
Here are some examples, which we’ll package up as a lemma: 


Lemma 3.3.1. The following statements hold for all sets S and T, and all 
predicates P: 


SDSnT (3.3.1) 
SCSUT (3.3.2) 
S2{xES| P(x)} (3.3.3) 
S$ =(SNT)U(S\T) (3.3.4) 


Proof. e (3.3.1) Let x bein SNOT. Then x € S and z € T, from the 
definition of SMT. It follows that x € S. Since x was arbitrary, we 
have that for all z in SNT, x is also in T; in other words, SNOT CT. 


e (3.3.2). Let x bein S. Then x € SV 2x € T is true, giving rE SUT. 


e (3.3.3) Let x be in {x € S| P(x)}. Then, by the definition of set 
comprehension, z € S and P(x). We don’t care about P(x), so we 
drop it to just get r € S. 


e (3.3.4). This is a little messy, but we can solve it by breaking it down 
into smaller problems. 


First, we show that S C (S\T)U(SMT). Let x be an element of S. 
There are two cases: 
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1. IfeeT, then x € (SMT). 

2. Ifa ¢T,thenx€(S\T). 
In either case, we have shown that x is in (SNT)U(S\T). This gives 
Se(SnT) Us T). 
Conversely, we show that (S\T)U(SNT) C S. Suppose that x € 
(S\T)U(SOT). Again we have two cases: 

1. Ifee(S\T), thnzxe Sanda ¢T. 

2. Ifee(SNT), then xe SandzreT. 


In either case, x € S. 


Since we’ve shown that both the left-hand and right-hand sides of 
(3.3.4) are subsets of each other, they must be equal. 


Using similar arguments, we can show that properties of A and V that 
don’t involve negation carry over to M and U in the obvious way. For example, 
both operations are commutative and associative, and each distributes over 
the other. 


3.4 Axiomatic set theory 


The problem with naive set theory is that unrestricted set comprehension 
is too strong, leading to contradictions. Axiomatic set theory fixes this 
problem by being more restrictive about what sets one can form. The axioms 
most commonly used are known as Zermelo-Fraenkel set theory with 
choice or ZFC. We'll describe the axioms of ZFC below, but in practice 
you mostly just need to know what constructions you can get away with. 
The short version is that you can construct sets by (a) listing their 
members, (b) taking the union of other sets, (c) taking the set of all subsets 
of a set, or (d) using some predicate to pick out elements or subsets of some 
set.! The starting points for this process are the empty set @ and the set N 
of all natural numbers (suitably encoded as sets). If you can’t construct a 
set in this way (like the Russell’s Paradox set), odds are that it isn’t a set. 
These properties follow from the more useful axioms of ZFC: 


‘Technically this only gives us Z, a weaker set theory than ZFC that omits Replacement 
(Fraenkel’s contribution) and Choice. 
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Extensionality Any two sets with the same elements are equal.” 


Existence The empty set 0 is a set.° 


Pairi 


Unio 


ng Given sets x and y, {x,y} is a set.* 


n For any set of sets S = {z,y,z,...}, the set US =xUyUzuU... 
exists.” 


Power set For any set S, the power set P(S) = {A| AC S} exists.° 


Specification For any set S and any predicate P, the set {x € S | P(x)} 


exists.’ This is called restricted comprehension, and is an axiom 
schema instead of an axiom, since it generates an infinite list of axioms, 
one for each possible P. Limiting ourselves to constructing subsets 
of existing sets avoids Russell’s Paradox, because we can’t construct 
S={az|a¢ zx}. Instead, we can try to construct S= {x eT |a¢ x}, 
but we’ll find that S isn’t an element of 7’, so it doesn’t contain itself 
but also doesn’t create a contradiction. 


Infinity There is a set that has @ as a member and also has x U {x} 


whenever it has x.° This gives an encoding of N where @ represents 
0 and x U {x} represents x + 1. Expanding out the z + 1 rule shows 
that each number is represented by the set of all smaller numbers, e.g. 
3 = {0,1,2} = {0, {0}, {0,{0}}}, which has the nice property that 
each number n is represented by a set with exactly n elements, and 
that a < b can be represented by a € b.° 


Without this axiom, we only get finite sets. 


(Technical note: the set whose existence is given by the Axiom of 
Infinity may also contain some extra elements outside of N, but we can 
strip them out—with some effort—using Specification. ) 


A: 


iVy:i(w=y) Oo (Vz: 2E€4Hzey). 
Vy: yE ou. 
:Vy:dze:Vq:qezeHq=rzVq=y. 
dy: Vz:z€ yo (Aq: z€qAqez). 
CS. 
yiVze:ze€yezEeann Plz). 
PexrAVyeu: yU{y} ex. 


Www u 
< 
<x 
X 
X 
an) 
< 
X 


°Natural numbers represented in this way are called finite von Neumann ordinals. 
These are a special case of the von Neumann ordinals, discussed in §3.5.5.4, which can 
also represent values that are not finite. 
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There are three other axioms that don’t come up much in computer 
science: 


Foundation Every nonempty set A contains a set B with ANB = 0.9 This 
rather technical axiom prevents various weird sets, such as sets that 
contain themselves or infinite descending chains Ag 3 A, 3 Ag 35.... 
Without it, we can’t do induction arguments!! once we get beyond N. 


Replacement If S is a set, and R(z,y) is a predicate with the property 
that Ve : dly : R(x,y), then {y | dr €¢ S: R(z,y)} is a set.!? Like 
comprehension, replacement is an axiom schema. Mostly used to 
construct astonishingly huge infinite sets. 


Choice For any set of nonempty sets S there is a function f that assigns 
to each x in S some f(x) € x. This axiom is unpopular in some 
circles because it is non-constructive: it tells you that f exists, but 
it doesn’t give an actual definition of f. But it’s too useful to throw 
out. 


Like everything else in mathematics, the particular system of axioms 
we ended up with is a function of the history, and there are other axioms 
that could have been included but weren’t. Some of the practical reasons 
for including some axioms but not others are described in a pair of classic 
papers by Maddy [Mad88a, Mad88b]. 


3.5 Cartesian products, relations, and functions 


Sets are unordered: the set {a,b} is the same as the set {b,a}. Sometimes it 
is useful to consider ordered pairs (a,b), where we can tell which element 
comes first and which comes second. These can be encoded as sets using the 
rule (a,b) = {{a}, {a, b}}, which was first proposed by Kuratowski [Kur21, 
Definition V].'° 


Ye AN: dyeca:any=9. 

"See Chapter 5. 

(Va: aly: R(a,y)) 2 Vz: 45q:Vriré€ qo (as €z: R(s,r)). 

This was not the only possible choice. Kuratowski cites a previous encoding suggested 
by Hausdorff [Haul4] of (a,b) as {{a,1}, {b,2}}, where 1 and 2 are tags not equal to a or 
b. He argues that this definition “seems less convenient to me” than {{a}, {a, b}}, because 
it requires tinkering with the definition if a or b turn out to be equal to 1 or 2. This is a 
nice example of how even though mathematical definitions arise through convention, some 
definitions are easier to use than others. 
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Given sets A and B, their Cartesian product Ax B is the set {(z,y) |x e AA y € B}, 
or in other words the set of all ordered pairs that can be constructed 
by taking the first element from A and the second from B. If A has n 
elements and B has m, then A x B has nm elements.'4 For example, 
{1,2} x {3,4} = {(1,3), (4), 2,3), (2.4) 

Because of the ordering, Cartesian product is not commutative in general. 
We usually have A x B 4 B x A. (Exercise: when are they equal?) 

The existence of the Cartesian product of any two sets can be proved 
using the axioms we already have: if (x,y) is defined as {{x},{x,y}}, then 
P(AU B) contains all the necessary sets {x} and {x,y}, and P(P(AU B)) 
contains all the pairs {{x}, {z,y}}. It also contains a lot of other sets we 
don’t want, but we can get rid of them using Specification. 

A special class of relations are functions. A function from a domain 
A to a codomain” B is a relation on A and B (i.e., a subset of A x B) 
such that every element of A appears on the left-hand side of exactly one 
ordered pair. We write f : A — B as a short way of saying that f is a 
function from A to B, and for each x € A write f(x) for the unique y € B 
with (x,y) € f.16 

The set of all functions from A to B is written as B4: note that the order 
of A and B is backwards here from A - B. Since this is just the subset of 
P(A x B) consisting of functions as opposed to more general relations, it 
exists by the Power Set and Specification axioms. 

When the domain of a function is finite, we can always write down a 
list of all its values. For infinite domains (e.g. N), almost all functions are 
impossible to write down, either as an explicit table (which would need to be 
infinitely long) or as a formula (there aren’t enough formulas). Most of the 
time we will be interested in functions that have enough structure that we 
can describe them succinctly, for obvious practical reasons. But in a sense 
these other, ineffable functions still exist, so we use a definition of a function 
that encompasses them. 

Often, a function is specified not by writing out some huge set of ordered 


pairs, but by giving a rule for computing f(z). An example: f(x) = x. 


14Tn fact, this is the most direct way to define multiplication on N, and pretty much the 
only sensible way to define multiplication for infinite cardinalities; see §11.1.5. 

The codomain is sometimes called the range, but most mathematicians will use range 
for {f(x) | « € A}, which may or may not be equal to the codomain B, depending on 
whether f is or is not surjective. 

Technically, knowing f alone does not tell you what the codomain is, since some 
elements of B may not show up at all. This can be fixed by representing a function as a 
pair (f, B), but it’s not something most people worry about. 
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Particular trivial functions can be defined in this way anonymously; another 
way to write f(2) = x? is as the anonymous function x +> 2”. 


3.5.1 Examples of functions 


e f(x) = 27. Note: this single rule gives several different functions, e.g. 
f:ROR, f: 2524, f:NON, f:Z—>N. Changing the domain 
or codomain changes the function. 


e f(z) =a2+1. 


e Floor and ceiling functions: when x is a real number, the floor of x 
(usually written ||) is the largest integer less than or equal to x and 
the ceiling of x (usually written [a]) is the smallest integer greater 
than or equal to x. E.g., [2] = [2] = 2, [2.337] = 2, [2.337] = 3. 


e The function from {0,1,2,3,4} to {a,b,c} given by the following table: 


Oa 
1 c¢ 
2 b 
30a 
4 b 


3.5.2 Sequences 


Functions let us define sequences of arbitrary length: for example, the infinite 
sequence 9,71, 22,... of elements of some set A is represented by a function 
x: N + A, while a shorter sequence (dao, a1, a2) would be represented by 
a function a: {0,1,2} — A. In both cases the subscript takes the place 
of a function argument: we treat x, as syntactic sugar for x(n). Finite 
sequences are often called tuples, and we think of the result of taking 
the Cartesian product of a finite number of sets A x B x Cas a set of 
tuples (a, b,c), even though the actual structure may be ((a, b),c) or (a, (b, c)) 
depending on which product operation we do first. 

We can think of the Cartesian product of k sets (where k need not be 2) 
as a set of sequences indexed by the set {1...k} (or sometimes {0...k — 1}). 
Technically this means that A x B x C (the set of functions from {1, 2,3} to 
AU BUC with the property that for each function f € Ax Bx C, f(1) € A, 
f(2) € B, and f(3) € C) is not the same as (A x B) x C (the set of all 
ordered pairs whose first element is an ordered pair in A x B and whose 
second element is in C’) or A x (B x C) (the set of ordered pairs whose first 
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element is in A and whose second element is in B x C’). This distinction has 
no practical effect and so we typically ignore it; the technical justification 
for this is that the three different representations are all isomorphic in the 
sense that a translation exists between each pair of them that preserves their 
structure. 

A special case is the Cartesian product of no sets. This is just the set 
containing a single element, the empty sequence. 

Cartesian products over indexed collections of sets can be written using 
product notation (see §6.2), as in 


or even 


ee 


zER 


3.5.3 Functions of more (or less) than one argument 


If f: Ax B > C, then we write f(a,b) for f((a,b)). In general we can 
have a function with any number of arguments (including 0); a function of k 
arguments is just a function from a domain of the form A; x Ag x... Az to 
some codomain B. 


3.5.4 Composition of functions 


Two functions f : A ~ Band g: B > C can be composed to give a 
composition go f. This is a function from A to C defined by (go f)(x) = 
g(f(x)). Composition is often implicit in definitions of functions: the function 
x+y 2*?+1 is the composition of two functions x4 24+ 1 and r+ 2?. 


3.5.5 Functions with special properties 


We can classify functions f : A > B based on how many elements x of the 
domain A get mapped to each element y of the codomain B. If every y is 
the image of at least one x, f is surjective. If every y is the image of at 
most one xz, f is injective. If every y is the image of exactly one x, f is 
bijective. !’ These concepts are formalized below. 


1’These terms, which are generally attributed to the group of mathematicians who 
published under the name Bourbaki [Bou70], are now pretty well established and have the 
advantage of being hard to confuse with each other. An older convention in English was 
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3.5.5.1 Surjections 


A function f : A — B that covers every element of B is called onto, 
surjective, or a surjection. This means that for any y in B, there exists 
some x in A such that y = f(a). An equivalent way to show that a function is 
surjective is to show that its range { f(a) | x € A} is equal to its codomain. 

For example, the function f(z) = 2? from N to N is not surjective, 
because its range includes only perfect squares. The function f(x) = «+1 
from N to N is not surjective because its range doesn’t include 0. However, 
the function f(z) = 2+ 1 from Z to Z is surjective, because for every y in Z 
there is some x in Z such that y=a+1. 


3.5.5.2 Injections 


If f : A> B maps distinct elements of A to distinct elements of B (i.e., 
if « # y implies f(x) 4 f(y)), it is called one-to-one, injective, or an 
injection. By contraposition, an equivalent definition is that f(a) = f(y) 
implies « = y for all x and y in the domain. For example, the function 
f(x) = 2? from N to N is injective. The function f(x) = 2? from Z to Z is 
not injective (for example, f(—1) = f(1) = 1). The function f(z) =2+1 
from N to N is injective. 


3.5.5.3 Bijections 


A function that is both surjective and injective is called a one-to-one cor- 
respondence, bijective, or a bijection. Any bijection f has an inverse 
function f~+; this is the function {(y, x) | (x,y) € f}. 

Of the functions we have been using as examples, only f(x) = «+1 from 
Z to Z is bijective. 


3.5.5.4 Bijections and counting 


Bijections let us define the size of arbitrary sets without having some special 
means to count elements. We say two sets A and B have the same size or 
cardinality if there exists a bijection f: A B. 

Often it is convenient to have standard representatives of sets of a given 
cardinality. A common trick is to use the von Neumann ordinals, which 
are sets that are constructed recursively so that each contains all the smaller 


to call surjective functions onto, injective functions one-to-one, and bijective functions 
one-to-one correspondences. This can lead to confusing between injective and bijection 
functions, so we’ll stick with the less confusing French-derived terminology. 
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ordinals as elements.'* The empty set @ represents 0, the set {0} represents 
1, {0,1} represents 2, and so on. The first infinite ordinal is w = {0,1,2,...}, 
which is followed by w + 1 = {0,1,2,...;w},w+2= {0,1,2,...;w,w +1}, 
and so forth; there are also much bigger ordinals like w? (which looks like 
w many copies of w stuck together), w” (which is harder to describe, but 
can be visualized as the set of infinite sequences of natural numbers with an 
appropriate ordering), and so on. Given any collection of ordinals, it has a 
smallest element, equal to the intersection of all elements: this means that 
von Neumann ordinals are well-ordered (see §9.5.6). So we can define the 
cardinality |A| of a set A formally as the unique smallest ordinal B such that 
there exists a bijection f: Ao B. 

This is exactly what we do when we do counting: to know that there 
are 3 stooges, we count them off 0 > Moe,1 — Larry,2 — Curly, giving a 
bijection between the set of stooges and 3 = {0,1,2}. 

Because different infinite ordinals may have the same cardinality, infi- 
nite cardinalities are generally not named for the smallest ordinal of that 
cardinality, but get their own names. So the cardinality |N| of the naturals 
is written as No, the next largest possible cardinality as Xi, etc. See §3.7.1 
for more details. 


3.6 Constructing the universe 


With power set, Cartesian product, the notion of a sequence, etc., we can 
construct all of the standard objects of mathematics. For example: 


Integers The integers are the set Z = {...,—2,—1,0,—1,2,...}. We rep- 
resent each integer z as an ordered pair (x,y), where x = 0 V y = 0; 
formally, Z = {(z,y) €Nx N| z=0Vy=O0}. The interpretation of 
(x,y) is x — y; so positive integers z are represented as (z,0) while 
negative integers are represented as (0,z). It’s not hard to define 
addition, subtraction, multiplication, etc. using this representation. 


'8The formal definition is that S is an ordinal if (a) every element of S is also a subset 
of S; and (b) every subset T of S' contains an element x with the property that x = y or 
zx € y for all y € T. In other words, every subset T of S has a minimal element with 
respect to €. If we treat € as <, this property makes S well-ordered (see §9.5.6). The 
fact that every subset of S has a minimal element means that we can do induction on 
S, since if there is some property that does not hold for all x in S, there must be some 
minimal x for which it doesn’t hold. So if we can prove that Vy < «: P(y) implies P(x), 
then it must be the case that P holds for every element of S, because otherwise we get a 
contradiction at the minimal x for which P does not hold. 
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Rationals The rational numbers Q are all fractions of the form p/q where 
p is an integer, g is a natural number not equal to 0, and p and q have 
no common factors. Each such fraction can be represented as a set 
using an ordered pair (p,q). Operations on rationals are defined as you 
may remember from grade school. 


Reals The real numbers R can be defined in a number of ways, all of which 
turn out to be equivalent. The simplest to describe is that a real number 
x is represented by pair of sets {ye Q|y< a} and {yEeQ|y>-az}; 
this is known as a Dedekind cut [Ded01]. Formally, a Dedekind cut 
is any pair of subsets (5,7) of Q with the properties that (a) S and 
T partition Q, meaning that SAT = 0 and SUT = Q; (b) every 
element of S is less than every element of T (Vs € SVt ET: 8s < t); 
and (c) S contains no largest element (Vz € Sdy € S: a < y). Note 
that real numbers in this representation may be hard to write down. 


A simpler but equivalent representation is to drop T, since it is just 
Q\S: this gives use a real number for any proper subset S of Q that has 
no largest element and is downward closed, meaning that x < y € S 
implies x € S. Real numbers in this representation may still be hard 
to write down. 


More conventionally, a real number can be written as an infinite decimal 
expansion like 


T & 3.14159265358979323846264338327950288419716939937510582... 


which is a special case of a Cauchy sequence that gives increasingly 
good approximations to the actual real number the further along you 


go. 
We can also represent standard objects of computer science: 


Deterministic finite state machines A deterministic finite state ma- 
chine is a tuple (%,Q, qo, 6, Qaccept) Where © is an alphabet (some 
finite set), Q is a state space (another finite set), qo € Q is an initial 
state, d:Q xX —> Q isa transition function specifying which state 
to move to when processing some symbol in 4, and Qaccept C @ is 
the set of accepting states. If we represent symbols and states as 
natural numbers, the set of all deterministic finite state machines is 
then just a subset of P(N) x P(N) x N x (Nes) x P(N) satisfying 


some consistency constraints. 
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3.7 Sizes and arithmetic 


We can compute the size of a set by explicitly counting its elements; for exam- 
ple, |@| = 0, |{Larry, Moe, Curly}| = 3, and |{z € N | x < 100 Az is prime}| = 
[{2, 3,5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97}| = 
25. But sometimes it is easier to compute sizes by doing arithmetic. We 
can do this because many operations on sets correspond in a natural way to 
arithmetic operations on their sizes. (For much more on this, see Chapter 11.) 

Two sets A and B that have no elements in common are said to be 
disjoint; in set-theoretic notation, this means ANB = @). In this case we have 
|AU B| = |A| + |B]. The operation of disjoint union acts like addition for 
sets. For example, the disjoint union of 2-element set {0, 1} and the 3-element 
set {Wakko, Jakko, Dot} is the 5-element set {0, 1, Wakko, Jakko, Dot}. 

The size of a Cartesian product is obtained by multiplication: |A x B| = 
|A|-|B|. An example would be the product of the 2-element set {a,b} with the 
3-element set {0, 1,2}: this gives the 6-element set {(a,0), (a, 1), (a, 2), (b, 0), (b, 1), (b, 2)}. 
Even though Cartesian product is not generally commutative, swapping each 
pair (a,b) to (b,a) is a bijection, so |A x B| =|B x Al. 

For power set, it is not hard to show that |P(S)| = 2I5I. This is a special 
case of the size of A®, the set of all functions from B to A, which is | A|!4I. 
for the power set we can encode P(9) using 2°, where 2 is the special set 
{0,1}, and a subset T of S is encoded by the function that maps each x € S 
to0ifa¢gTandlifxeT. 


3.7.1 Infinite sets 


For infinite sets, we take the above properties as definitions of addition, 
multiplication, and exponentiation of their sizes. The resulting system is 
known as cardinal arithmetic, and the sizes that sets (finite or infinite) 
might have are known as cardinal numbers. 

The finite cardinal numbers are just the natural numbers: 0,1,2,3,.... 
The first infinite cardinal number is the size of the set of natural numbers, 
and is written as No (aleph-zero, aleph-null, or aleph-nought). The 
next infinite cardinal number is &; (aleph-one): it might or might not be 
the size of the set of real numbers, depending on whether you include the 
Generalized Continuum Hypothesis in your axiom system.'? 


The generalized continuum hypothesis says (essentially) that there aren’t any more 
cardinalities out there in between the ones whose existence can be deduced from the other 
axioms of set theory. A consequence of this is that there are no cardinalities between 
|N| and |R|. An alternative notation exists if you don’t want to take a position on GCH: 
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Infinite cardinals can behave very strangely. For example: 


e No+No = No. In other words, it is possible to have two sets A and B that 
both have the same size as N, take their disjoint union, and get another 
set A+ B that has the same size as N. To give a specific example, 
let A = {2x | x € N} (the even numbers) and B = {2x+1|2e€N} 
(the odd numbers). These have |A| = |B| = |N| because there is a 
bijection between each of them and N built directly into their definitions. 
It’s also not hard to see that A and B are disjoint, and that AUB =N. 
So |A| = |B] = |A]+ |B] in this case. 


The general rule for cardinal addition is that « + \ = max(x, d) if at 
least one of « and 4 is infinite. Sums of finite cardinals behave exactly 
the way you expect. 


e No- No = No. Example: A bijection between N x N and N using the 
Cantor pairing function (x,y) = (e+y+1)(a+y)/2+y. The first few 
values of this are (0,0) = 0, (1,0) = 2-1/2+0 =1,(0,1) =2-1/2+1= 
2, (2,0) = 3-2/2+0 = 3, (1,1) = 3-2/24+1=4, (0,2) =3-2/24+2=5, 
etc. The basic idea is to order all the pairs by increasing x+y, and then 
order pairs with the same value of x + y by increasing y. Eventually 
every pair is reached. 


The general rule for cardinal multiplication is that «-\ = max(k, A) if 
at least one of « or A is infinite. So«-A = «+4 J if either is infinite (or 
both are zero). 


e N* = {all finite sequences of elements of N} has size No. One way 
to do this to define a function recursively by setting f({]) = 0 and 
f({first, rest]) = 1+ (first, f(rest)), where first is the first element of 


this writes Do (“beth-0”) for |N|, 21 (“beth-1”) for |R| = |P(N)|, with the general rule 
digi = 27%. This avoids the issue of whether there exist sets with size between N and R, 
for example. In my limited experience, only hard-core set theorists ever use J instead of 
XN: in the rare cases where the distinction matters, most normal mathematicians will just 
assume GCH, which makes 2; = ; for all i. 


CHAPTER 3. SET THEORY 68 


the sequence and rest is all the other elements. For example, 


f(0,1,2) =1+ @, f(, 2)) 
=1+ (0,1+ (1, f(2))) 
= 1+ (0,1+ (1,14 (2,0))) 
=14+ (0,14 (1,14+3))=14+(0,1+(1,4)) 
=1+ (0,1419) 
= 1+ (0,20) 
= 14230 
= 231. 


This assigns a unique element of N to each finite sequence, which is 
enough to show |N*| < |N|. With some additional effort one can show 
that f is in fact a bijection, giving |N*| = |N]. 


3.7.2. Countable sets 


The sets N, N?, and N* all have the property of being countable, which 
means that they can be put into a bijection with N or one of its subsets. 
Countability of N* means that anything you can write down using finitely 
many symbols (even if they are drawn from an infinite but countable alphabet) 
is countable. This has a lot of applications in computer science: one of them 
is that the set of all computer programs in any particular programming 
language is countable. 


3.7.3. Uncountable sets 


Exponentiation is different. We can easily show that 2*° 4 No, or equivalently 
that there is no bijection between P(N) and N. This is done using Cantor’s 
diagonalization argument, which appears in the proof of the following 
theorem. 


Theorem 3.7.1. Let S be any set. Then there is no surjection f : S + P(S). 


Proof. Let f : S + P(S') be some function from S to subsets of S. We'll 
construct a subset of S that f misses, thereby showing that f is not a 
surjection. Let A= {xe S|a¢ f(x)}. Suppose A= f(y). ThenyeE Avo 
y ¢ A, a contradiction.”° 


20F:xercise: Why does A exist even though the Russell’s Paradox set doesn’t? 
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Since any bijection is also a surjection, this means that there’s no bijection 
between S and P(S) either, implying, for example, that |N| is strictly less 
than |P(N)]. 

(On the other hand, it is the case that INN| a Pall so things are still 
weird up here.) 

Sets that are larger than N are called uncountable. A quick way to 
show that there is no surjection from A to B is to show that A is countable 
but B is uncountable. For example: 


Corollary 3.7.2. There are functions f : N — {0,1} that are not computed 
by any computer program. 


Proof. Let P be the set of all computer programs that take a natural number 
as input and always produce 0 or 1 as output (assume some fixed language), 
and for each program p € P, let f, be the function that p computes. We’ve 
already argued that P is countable (each program is a finite sequence drawn 
from a countable alphabet), and since the set of all functions f : N > 
{0,1} = 2" has the same size as P(N), it’s uncountable. So some f gets 
missed: there is at least one function from N to {0,1} that is not equal to fp 
for any program p. 


The fact that there are more functions from N to N than there are 
elements of N is one of the reasons why set theory (slogan: “everything is 
a set”) beat out lambda calculus (slogan: “everything is a function from 
functions to functions”) in the battle over the foundations of mathematics. 
And this is why we do set theory in CPSC 202 and lambda calculus (disguised 
as Scheme) in CPSC 201. 


3.8 Further reading 


See [Ros12, §§2.1—-2.2], [Big02, Chapter 2], or [Fer08, §1.3, §1.5]. 


Chapter 4 


The real numbers 


The real numbers R are the subject of high-school algebra and most 
practical mathematics. Some important restricted classes of real numbers 
are the naturals N = 0,1,2,..., the integers Z = ...,—2,—1,0,1,2,..., 
and the rationals Q, which consist of all real numbers that can be written 
as ratios of integers p/q, otherwise known as fractions. 

The rationals include 1,3/2,22/7,—355/113, an so on, but not some 
common mathematical constants like e © 2.718281828... or m ¥ 3.141592.... 
Real numbers that are not rational are called irrational. There is no single- 
letter abbreviation for the irrationals. 

The typeface used for N, Z, Q, and R is called blackboard bold and 
originates from the practice of emphasizing a letter on a blackboard by 
writing it twice. Some writers just use ordinary boldface: N, etc., but this 
does not scream out “this is a set of numbers” as loudly as blackboard bold. 
You may also see blackboard bold used for the complex numbers C, which 
are popular in physics and engineering, and for some more exotic number 
systems like the quaternions H,! which are sometimes used in graphics, or 
the octonions O, which exist mostly to see how far complex numbers can 
be generalized. 

Like any mathematical structure, the real numbers are characterized by 
a list of axioms, which are the basic facts from which we derive everything 
we know about the reals. There are many equivalent ways of axiomatizing 
the real numbers; we will give one here. Many of these properties can also 
be found in [Fer08, Appendix B]. These should mostly be familiar to you 
from high-school algebra, but we include them here because we need to know 


‘Why H? The rationals already took Q (for “quotient”), so the quaternions are 
abbreviated by the initial of their discoverer, William Rowan Hamilton. 
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what we can assume when we want to prove something about reals, and also 
because it lets us sneaky in definitions of various algebraic structures like 
groups and fields that will turn out to be useful later. 


4.1 Field axioms 


The real numbers are a field, which means that they support the operations 
of addition +, multiplication -, and their inverse operations subtraction — 
and division /. The behavior of these operations is characterized by the field 
axioms. 


4.1.1 Axioms for addition 


Addition in a field satisfies the axioms of a commutative group (often 
called an abelian group, after Niels Henrik Abel, an early nineteenth- 
century mathematician). These characterize the behavior of the addition 
operation + and sums of the form a+ b (“a plus b”). 


Axiom 4.1.1 (Commutativity of addition). For all numbers, 
a+b=b+a. (4.1.1) 


Any operation that satisfies Axiom 4.1.1 is called commutative. Com- 
mutativity lets us ignore the order of arguments to an operation. Later, we 
will see that multiplication is also commutative. 


Axiom 4.1.2 (Associativity of addition). For all numbers, 
a+(b+c)=(a+b)+c. (4.1.2) 


An operation that satisfies Axiom 4.1.2 is called associative. Asso- 
ciativity means we don’t have to care about how a sequence of the same 
associative operation is parenthesized, letting us write just a+ 6+ for 
at+(b+c)=(at+b)+c 


2 curious but important practical fact is that addition is often not associative in 
computer arithmetic. This is because computers (and calculators) approximate real 
numbers by floating-point numbers, which only represent the some limited number 
of digits of an actual real number in order to make it fit in limited memory. This 
means that low-order digits on very large numbers can be lost to round-off error. So 
a computer might report (1000000000000 + —1000000000000) + 0.00001 = 0.00001 but 
1000000000000 + (—1000000000000 + 0.00001) = 0.0. Since we don’t have to write any 
programs in this class, we will just work with actual real numbers, and not worry about 
such petty numerical issues. 
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Axiom 4.1.3 (Additive identity). There exists a number 0 such that, for 
all numbers a, 
a+0=0+a=a. (4.1.3) 


An object that satisfies the condition a+0 = 0+a = a for some operation 
is called an identity for that operation. Later we will see that 1 is an identity 
for multiplication. 

It’s not hard to show that identities are unique: 


Lemma 4.1.4. Let 0'+a=a+0! =a for alla. Then 0! =0. 


Proof. Compute 0/ = 0'+ 0 = 0. (The first equality holds by the fact that 
a=a+0 for all a and the second from the assumption that 0’ + a = a for 
all a.) 


Axiom 4.1.5 (Additive inverses). For each a, there exists a number —a, 
such that 
a+ (—a) = (-a)+a=0. (4.1.4) 


For convenience, we will often write a + (—b) as a — b (“a minus 0”). 
This gives us the operation of subtraction. The operation that returns —a 
given a is called negation and —a can be read as “negative a,” “ 
a”,® or the “negation of a.” 

Like identities, inverses are also unique: 


minus 


Lemma 4.1.6. [f a’ +a=a+a’' =0, then a’ = —a. 


Proof. Starting with 0 = a’ +a, add —a on the right to both sides to get 
—-a=a'+a+-a=a, 


4.1.2 Axioms for multiplication 


Multiplication in a field satisfies the axioms of a commutative group, if the 
additive identity 0 is excluded. 

For convenience’, the multiplication operation - is often omitted, allowing 
us to write ab for a- 6. We will use this convention when it will not cause 
confusion. 


3Warning: Some people will get annoyed with you over “minus a” and insist on reserving 
“minus” for the operation in a — b. In extreme cases, you may see —a typeset differently: 
-a. Pay no attention to these people. Though not making the distinction makes life more 
difficult for calculator designers and compiler writers, as a working mathematician you are 
entitled to abuse notation by using the same symbol for multiple purposes when it will 
not lead to confusion. 

4 Also called “laziness.” 
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Axiom 4.1.7 (Commutativity of multiplication). For all numbers, 
ab = ba. (4.1.5) 
Axiom 4.1.8 (Associativity of multiplication). For all numbers, 
a(bc) = (ab)c. (4.1.6) 


Axiom 4.1.9 (Multiplicative identity). There exists a number 1 #0 such 
that, for all numbers a, 
el= Lig—i, (4.1.7) 


We insist that 1 4 0 because we want Axiom 4.1.9 to hold in R \ {0}. 
This also has the beneficial effect of preventing us from having R = {0}, 
which would otherwise satisfy all of our axioms. 

Since the only difference between the multiplicative identity and the 
additive identity is notation, Lemma 4.1.4 applies here as well: if there is 
any 1’ such that a- 1! = 1'-a=a for all a, then 1’ = 1. 


Axiom 4.1.10 (Multiplicative inverses). For every a except 0, there exists 
a number a~', such that 


ig =a --o= i (4.1.8) 


Lemma 4.1.6 applies here to show that a7! is also unique for each a. 

For convenience, we will often write a-b~! as a/b or the vertical version 
5: This gives us the operation of division. The expression a/b or ¢ is 
pronounced “a over b” or (especially in elementary school, whose occupants 
are generally not as lazy as full-grown mathematicians) “a divided by b.” 
Some other notations for this operation are a + b and a: b. These are also 
mostly used in elementary school.” 

Note that because 0 is not guaranteed to have an inverse,° the meaning 
of a/O is not defined. 

The number a~!, when it does exist, is often just called the inverse of a 
or sometimes “inverse a.” (The ambiguity that might otherwise arise with the 
additive inverse —a is avoided by using negation for —a.) The multiplicative 
inverse a~! can also be written using the division operation as 1/a. 


°Using a colon for division is particularly popular in German-speaking countries, where 
the “My Dear Aunt Sally” rule for remembering that multiplication and division bind tighter 
than addition and subtraction becomes the more direct Punktrechnung vor Strichrech- 
nung—point reckoning before stroke reckoning.” 

°Tn fact, once we get a few more axioms, terrible things will happen if we try to make 0 
have an inverse. 
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4.1.3. Axioms relating multiplication and addition 
Axiom 4.1.11 (Distributive law). For all a, b, and c, 
a-(b+c)=ab+ac (4.1.9) 
(a+ b)-c=ac+t+bce (4.1.10) 
Since multiplication is commutative, we technically only need one of 
(4.1.9) and (4.1.10), but there are other structures we will see called rings 
that satisfy the distributive law without having a commutative multiplication 
operation, so it’s safest to include both. 


The additive identity 0 also has a special role in multiplication, which is 
a consequence of the distributive law: it’s an annihilator: 


Lemma 4.1.12. For all a, 
a> 0=0-0=0, (4.1.11) 


Proof. Because 0 = 0+ 0, we have a-0 =a-(0+0) =a-0+a-0. But then 
adding —(a-0) to both sides gives 0 = a- 0. 


Annihilation is why we don’t want to define 0~!, and thus won’t allow 
division by zero. If there were a real number that was 0~!, then for any a 
and b we would have: 


a-0=b-0=0 
(a-0)-0-' = (b-0)-07} 
a-(0-0-') =b- (0-074) 
a-1=6-1 
a= b. 
(Exercise: which axiom is used at each step in this proof?) 


In particular, we would get 1 = 0, contradicting Axiom 4.1.9. 
A similar argument shows that 


Lemma 4.1.13. Ifa-b=0, thena=0 or b=0. 


Proof. Suppose a:b =0 but a 40.’ Then a has an inverse a~!. So we can 
compute 


a-b=0 (4.1.12) 
oe sab Sa) (4.1.13) 
b=0. (4.1.14) 


"This is an example of the proof strategy where we show P V Q by assuming —P and 
proving Q. 
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Another consequence of the distributive law is that we can determine 
how multiplication interacts with negation. You may recall being taught at 
an impressionable age that 


a-(—b) = —(ab), (4.1.15) 
(—a) -b = —(ab), (4.1.16) 

and 
(—a) - (—b) = ab. (4.1.17) 


Like annihilation, these are not axioms—or at least, we don’t have to include 
them as axioms if we don’t want to. Instead, we can prove them directly 
from axioms and theorems we’ve already got. For example, here is a proof 
of (4.1.15): 


a:-0=0 
a-(b+(—b)) =0 
ab+a-(—b) =0 
—(ab) + (ab+a-(—b)) = —(ab) 
(—(ab) + ab) + a- (—b) = —(ab) 
0+ a-(—b) = —(ab) 
a - (—b) = —(ab). 


Similar proofs can be given for (4.1.16) and (4.1.17). 
A special case of this is that multiplying by —1 is equivalent to negation: 


Corollary 4.1.14. For alla, 


(-1)-a=-a. (4.1.18) 


Proof. Using (4.1.17), (—1)-@= —(1-a) =—-a. 


4.1.4 Other algebras satisfying the field axioms 


The field axioms so far do not determine the real numbers. They also hold for 
any number of other fields, including the rationals Q, the complex numbers 
C, and various finite fields such as the integers modulo a prime p (written as 
Z»; we'll see more about these in Chapter 14). 
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They do not hold for the integers Z (which don’t have multiplicative 
inverses) or the natural numbers N (which don’t have additive inverses either). 
This means that Z and N are not fields, although they are examples of weaker 
algebraic structures (a ring in the case of Z and a semiring in the case of 
N). 


In order to get the reals, we will need a few more axioms. 


4.2 Order axioms 


Unlike C and Z, (but like Q), the real numbers are an ordered field, 
meaning that in addition to satisfying the field axioms, there is a relation < 
that satisfies the axioms: 


Axiom 4.2.1 (Comparability). a <b orb<a. 

Axiom 4.2.2 (Antisymmetry). [fa <b andb<a, thena=b. 

Axiom 4.2.3 (Transitivity). I[fa<bandb<c, thena<c. 

Axiom 4.2.4 (Translation invariance). If a <b, thena+c<b+e. 
Axiom 4.2.5 (Scaling invariance). Ifa <b and0<c, thena-c<b-c. 


The first three of these mean that < is a total order (see §9.5.5). The 
other axioms describe how < interacts with addition and multiplication. 

For convenience, we define a < b as shorthand for a < b and a ¥ b, and 
define reverse operations a > b (meaning b < a) and a > b (meaning b < a). 
If a > 0, we say that a is positive. If a < 0, it is negative. If a > 0, it is 
non-negative. Non-positive can be used to say a < 0, but this doesn’t 
seem to come up as much as non-negative. 

Other properties of < can be derived from these axioms. 


Lemma 4.2.6 (Reflexivity). For all x, x < a. 


Proof. Apply comparability with y = z. 


Lemma 4.2.7 (Trichotomy). Exactly one of x <y, «=y, or x > y holds. 


Proof. First, let’s show that at least one holds. If x = y, we are done. 
Otherwise, suppose « # y. From comparability, we have « < y or y < a. 
Since « # y, this gives either x < y or x > y. 

Next, observe that x = y implies x < yand x # y, sincex < yanda>y 
are both defined to hold only when x 4 y. This leaves the possibility that 
x<yandxz>y. But then « < y and y < g, so by anti-symmetry, x = y, 
contradicting our assumption. So at most one holds. 
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Trichotomy lets us treat, for example, x < y and x > y as equivalent. 
Lemma 4.2.8. If a> 0, then —a <0. 


Proof. Take a > 0 and add —a to both sides (using Axiom 4.2.4) to get 
0 > —a. 


Lemma 4.2.9. For alla and b, a> 6 if and only ifa—b> 0. 


Proof. Given a > b, add —b to both sides to get a— b> 0. Given a—b>0, 
do the reverse by adding 6 to both sides. 


Lemma 4.2.10. Ifa >0 andb>0, thena+b> 0. 


Proof. From Lemma 4.2.8, a > 0 implies 0 > —a. So b > 0 > —a by 
transitivity. Add a to both sides to get a+ b> 0. 


Theorem 4.2.11. Ifa >b andc>d, thena+c>b+d. 


Proof. From Lemma 4.2.9, a—b > 0 and c—d> 0. From Lemma 4.2.10, we 
get (a—b)+(c—d) > 0. Now add b+d to both sides to get a+c > b+d. 


Lemma 4.2.12. Ifa <b, then —b< —a. 


Proof. Subtract a+ 6 from both sides. 
Theorem 4.2.13. [fa<b andc<0, thena-c>b-c. 


Proof. From Lemma 4.2.8, —c > 0, so from Axiom 4.2.5, —c-a < —c-b. Now 
apply Lemma 4.2.12 to get c-a>c-b. 


4.3 Least upper bounds 


One more axiom is needed to characterize the reals. A subset S of R has 
an upper bound if there is some x € R such that y < x for all y in S. It 
has a least upper bound if there is a smallest z with this property: some 
z such that (a) z is an upper bound on S and (b) whenever g is an upper 
bound on S, z < q. 


Axiom 4.3.1 (Least upper bound property). Every nonempty subset of R 
that has an upper bound has a least upper bound. 
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More formally, if for some S C R, S 4 @ and there exists some x such 
that y < x for all y in S, then there exists z € R such that y < z for all y in 
S and whenever y < q for all y in S, z <q. 

The least upper bound of a set S, if there is one, is called the supremum 
of S and written as sup S. A consequence of the least upper bound property 
is that every nonempty set S that has a lower bound has a greatest lower 
bound, or infimum: inf S = —sup{—«z|z¢S}. Neither the supremum 
nor the infimum is defined for empty or unbounded sets.°® 

It may be that sup S or inf S is not actually an element of S. For example, 
sup{x ER|a<1}=1, but 1 ¢g{r@eR|a< lI}. 

Having least upper bounds distinguishes the reals from the rationals: 
The bounded nonempty set {x € Q|x- ax < 2} has no least upper bound in 
Q (because V2 is not rational), but it does in R. (This is another example 
of a set that doesn’t include its least upper bound.) 

A consequence of having least upper bounds is that reals do not get too 
big or too small: 


Theorem 4.3.2 (Archimedean property). For any two real numbers 0 < 
z<y, there exists somen€N such thatn-x > y. 


Proof. The proof is by contradiction. 

Suppose that this is not true, that is, that there exist 0 < © < y such 
that n- a < y for alln € N. Dividing both sides by x gives n < x /y for all 
n € N, meaning that x/y is an upper bound on N. From the least upper 
bound property, there exists a least upper bound z on N. 

Now consider z — 1. This is less than z, so it’s not an upper bound on 
N. If we negate the statement Vn € Nn < z—1, we get Ine Nn>z-1. 
But then n+ 1 > z, contradicting the claim that z is an upper bound. 


This excludes the possibility of infinitesimals, nonzero values that are 
nonetheless smaller than every positive rational number. You can blame 
Bishop Berkeley [Ber34] for the absence of these otherwise very useful objects 
from our standard mathematical armory. However, in return for losing 
infinitesimals we do get that the rationals are dense in the reals, meaning 
that between any two reals is a rational, as well as many other useful 
properties. 


5It’s sometimes convenient to extend R by adding two extra elements —oo and +00, 
where —oo is smaller than all reals and +00 is bigger. In the resulting extended real line, 
we can define inf S = —oo when S' has no lower bound, sup S = +oo when S has no upper 
bound, and inf @ = +00 and sup @ = —oo. These last two conventions are chosen because 
they preserve the rules inf(S UT) = min(inf S, inf T) and sup(S UT) = max(sup S, sup T). 
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4.4 What’s missing: algebraic closure 


One way to think about the development of number systems is that each 
system N, Z, Q, R, and C adds the ability to solve equations that have no 
solutions in the previous system. Some specific examples are 


cr+1=0 Solvable in Z but not N 
26 = 1 Solvable in Q but not Z 
u-x=2 Solvable in R but not Q 
z-x+1=0 Solvable in C but not R 


This process stops with the complex numbers C, which consist of pairs of 
the form a+ bi where i? = —1. The reason is that the complex numbers are 
algebraically closed: if you write an equation using only complex numbers, 
+, and -, the it has at least one solution in C. What we give up in moving 
from R to C is that we lose order: there is no ordering of complex numbers 
that satisfies the translation and scaling invariance axioms. As in many other 
areas of mathematics and computer science, we are forced to make trade-offs 
based on what is important to us at the time. 


4.5 Arithmetic 


In principle, it is possible to show that the standard grade-school algorithms 
for arithmetic all work in R as defined by the axioms in the preceding sections. 
This is sometimes trickier than it looks: for example, just showing that 1 is 
positive requires a sneaky application of Axiom 4.2.5.° 

To avoid going nuts, we will adopt the following rule: 


Rule 4.5.1. Any grade-school fact about arithmetic that does not involve 
any variables will be assumed to be true in R. 


So for example, you don’t need to write out a proof using the definition of 
multiplicative inverses and the distributive law to conclude that 5 + : a ta} 
just remembering how to add fractions (or getting a smart enough computer 
to do it for you) is enough. 

Caveat: Dumb computers will insist on returning useless decimals like 
1.1. As mathematicians, we don’t like decimal notation, because it can’t 
represent exactly even trivial values like $- Similarly, mixed fractions like 


lt while useful for carpenters, are not popular in mathematics. 


°Suppose 1 <0. Then 1-1>0-1 (Theorem 4.2.13, which simplifies to 1 > 0. Since 
1 £0, this contradicts our assumption, showing that 1 > 0. 


CHAPTER 4. THE REAL NUMBERS 80 


4.6 Connection between the reals and other stan- 
dard algebras 


The reals are an example of an algebra, which is a set with various operations 
attached to it: the set is R itself with the operations being 0, 1, +, and-. A 
sub-algebra is a subset that is closed under the operations, meaning that 
the results of any operation applied to elements of the subsets (no elements 
in the case of 0 or 1) yields an element of the subset. 

All sub-algebras of R inherit any properties that don’t depend on the 
existence of particular elements other than 0 and 1; so addition and multipli- 
cation are still commutative and associative, multiplication still distributes 
over addition, and 0 and 1 are still identities. But other axioms may fail. 

Some interesting sub-algebras of R are: 


e The natural numbers N. This is the smallest sub-algebra of R, 
because once you have 0, 1, and addition, you can construct the rest 
of the naturals as 1 +1, 1+1+1, etc.'° They do not have additive or 
multiplicative inverses, but they do satisfy the order axioms, as well as 
the extra axiom that 0 < a for all xz EN. 


e The integers Z. These are what you get if you throw in additive 
inverses: now in addition to 0, 1, 1 +1, etc., you also get —1, —(1+1), 
etc.!! The order axioms are still satisfied. No multiplicative inverses, 
though. 


e The dyadics D. These are numbers of the form m27~” where m € Z 
and n € N. These are of some importance in computing because almost 
all numbers represented inside a computer are really dyadics, although 
in mathematics they are not used much. Like the integers, they still 
don’t have multiplicative inverses: there is no way to write 1/3 (for 
example) as m2~". 


e The rationals Q. Now we ask for multiplicative inverses, and get them. 
Any rational can be written as p/q where p and q are integers. Unless 


10Formally, we can define N as the smallest subset of R that contains 0 and 1, and is 
closed under addition. This definition works because given any subsets S and T that has 
these properties, so does their intersection. So we can let N be the intersection of all 
subsets of R that contain 0 and 1 and are closed under addition. 

It happens to be the case that with this definition, the naturals are also closed under 
multiplication. Proving this is a bit of a nuisance, since the obvious way to do this requires 
an induction argument, which we will get to in Chapter 5. 

"Like the integers, these can be defined as the smallest subset of R containing 0 and 1 
that is closed under addition and additive inverse. 


CHAPTER 4. THE REAL NUMBERS 81 


extra restrictions are put on p and q, these representations are not 
unique: 22/7 = 44/14 = 66/21 = (—110)/(—35). You probably first 
saw these in grade school as fractions, and one way to describe Q is 
as the field of fractions of Z. 


The rationals satisfy all the field axioms, and are the smallest sub-field 
of R. They also satisfy all the ordered field axioms and the Archimedean 
property. But they are not complete. Adding completeness gives the 
real numbers. 


An issue that arises here is that, strictly speaking, the natural numbers 
N we defined back in §3.4 are not elements of R as defined in terms of, 
say, Dedekind cuts. The former are finite ordinals while the latter are 
downward-closed sets of rationals, themselves represented as elements of 
N x N. Similarly, the integer elements of Q will be pairs of the form (n, 1) 
where n € N rather than elements of N itself. We also have a definition (§I.1) 
that builds natural numbers out of 0 and a successor operation S. So what 
does it mean to say NC QC R? 

One way to think about it is that the sets 


{0, {9}, (0, {O}} 10, {0}, 10, {Ob Fb... 
= Ca eI an Gs ee 
11i@,@) |p <0} ,{@.@ |p< a} 4(p,9) | p< 20}, 1,@) | p< Bg} sue fy 


and 
{0, SO, SS0, SSS0, ...} 


are all isomorphic: there are bijections between them that preserve the 
behavior of 0, 1, +, and -. So we think of N as representing some Platonic 
ideal of natural-numberness that is only defined up to isomorphism.!? So in 
the context of R, when we write N, we mean the version of N that is a subset 
of R, and in other contexts, we might mean a different set that happens to 
behave in exactly the same way. 

In the other direction, the complex numbers are a super-algebra of the 
reals: we can think of any real number x as the complex number x + (i, 
and this complex number will behave exactly the same as the original real 
number x when interacting with other real numbers carried over into C in 
the same way. 

The various features of these algebras are summarized in Table 4.1. 


12Tn programming terms, N is an interface that may have multiple equivalent implemen- 
tations. 
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Symbol N Z, Q R C 

Name Naturals Integers Rationals Reals Complex numbers 
Typical element 12 —12 2 J12 V1? + 2 i 
Associative Yes Yes Yes Yes Yes 

0 and 1 Yes Yes Yes Yes Yes 
Inverses No + only Yes Yes Yes 
Ordered Yes Yes Yes Yes No 

Least upper bounds Yes Yes No Yes No 
Algebraically closed No No No No Yes 


Table 4.1: Features of various standard algebras 


4.7 Extracting information from reals 


The floor function |x| and ceiling function [x] can be used to convert an 
arbitrary real to an integer: the floor of x is the largest integer less than 
or equal to x, while the ceiling of x is the smallest integer greater than or 
equal to x. More formally, they are defined by |x| = sup {y € Z| y < x} and 
[x] =inf {y € Z| y > x}. The floor and ceiling will always be integers, with 
z-1< |e] <a < [x] < +1. If zis already an integer, |x| = x = [x]. Some 
examples: |a| = 3, [a] =4, |—1/2] = —1, [—1/2] =0, [12] = [12] = 12. 

If you want the fractional part of a real number x, you can compute it 
as x — |x|. 

The absolute value |x| of x is defined by 


—x ifx <0, 
|x| = , 
x ifz>0. 


The absolute value function erases the sign of x: |—12| = |12| = 12. 
The signum function sgn(z) returns the sign of its argument, encoded 
as —1 for negative, 0 for zero, and +1 for positive: 


—-1 ifx<40, 
gen(e)— <0 ate]; 
tl: ita >0: 
So sgn(—12) = —1, sgn(0) = 0, and sgn(12) = 1. This allows for an 


alternative definition of |x| as sgn(x) - x. 


Chapter 5 


Induction and recursion 


Induction is a technique for proving universal statements about some class 
of objects built from smaller objects: the idea is to show that if each object 
has a property provided all the smaller objects do, then every object in 
the class has the property. Recursion is the same technique applied to 
definitions instead of proofs: an object is defined in terms of smaller objects 
of the same type. 


5.1 Simple induction 


The simplest form of induction goes by the name of simple induction, and 
it’s what we use to show that something is true for all natural numbers. 

We have several equivalent definitions of the natural numbers N, but 
what they have in common is the following basic pattern, which goes back 
to Peano [Pea89]: 


e 0 is a natural number. 
e If x is a natural number, so is ++ 1. 


This is an example of a recursive definition: it gives us a base object 
to start with (0) and defines new natural numbers (z + 1) by applying some 
operation (+1) to natural numbers we already have z. 

Because these are the only ways to generate natural numbers, we can 
prove that a particular natural number has some property P by showing 
that you can’t construct a natural number without having P be true. This 
means showing that P(0) is true, and that P(x) implies P(x +1). If both of 
these statements hold, then P is baked into each natural number as part of 
its construction. 
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We can express this formally as the induction schema: 
(P(0) \Vz EN: (P(a2) > P(a+1))) 3 Va EN: P(a). (5.1.1) 


Any proof that uses the induction schema will consist of two parts, the 
base case showing that P(0) holds, and the induction step showing that 
P(x) > P(«+1). The assumption P(x) used in the induction step is called 
the induction hypothesis. 

For example, let’s suppose we want to show that for all n € N, either 
n = 0 or there exists n’ such that n = n’ + 1. Proof: We are trying to show 
that P(n) holds for all n, where P(n) says x = OV (Aa’: 4 =2'+1). The 
base case is when n = 0, and here the induction hypothesis holds by the 
addition rule. For the induction step, we are given that P(a) holds, and 
want to show that P(a +1) holds. In this case, we can do this easily by 
observing that P(x +1) expands to (2 +1) =0V (Aaa’:2+1=2'+1). So 
let 2’ = x and we are done.! 


Here’s a less trivial example. So far we have not defined exponentiation 
for natural numbers. Let’s solve this by declaring 


oo =1 
ghtl = 7.7” 
where n ranges over all elements of N. 

This is a recursive definition: to compute, say, 2+, we expand it out 
using (5.1.3) until we bottom out at (5.1.2). This gives 24 = 2-23 = 2-2-2? = 
2. 2« 2-09 = 9-9-9-9~1= 16, 

If we want to prove something about our newly-defined operation, we are 
likely to end up using induction. 


Theorem 5.1.1. Ifa > 1, then a” >1 for alln > 0 


Proof. Let a> 1. 

Since we are looking at a universal statement about almost all naturals, 
we’re going to prove it by induction. This requires choosing an induction 
hypothesis. We can rewrite the claim slightly as for all n, n > 0 implies 
a” > 1. 

Base case: If n = 0, then n % 0, so the induction hypothesis holds 
vacuously. 


‘This is admittedly not a very interesting use of induction, since we don’t actually use 
P(x) in proving P(x +1). 
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Induction step: Suppose the induction hypothesis holds for n, i.e., that 
n>0O-—-> a" > 1. We want to show that it also holds for n+ 1. Annoyingly, 
there are two cases we have to consider: 

0 


1. n=0. Then we can compute a! = a-a a-l=a>l. 


2. n > 0. The induction hypothesis now gives a” > 1 (since in this case 
the premise n > 0 holds), so a’t! =a-a" >a-1>1. 


5.2 Alternative base cases 


One of the things that is apparent from the proof of Theorem 5.1.1 is that 
being forced to start at 0 may require painful circumlocutions if 0 is not the 
first natural for which we the predicate we care about holds. So in practice 
it is common to use a different base case. This gives a generalized version of 
the induction schema that works for any integer base: 


(P(zo) AVz € Zz > 29: (P(z) > P(z4+1))) 3 Vz € Z,z> 2: Plz) 
(5.2.1) 

Intuitively, this works for the same reason (5.1.1) works: if P is true for 
zo, then any larger integer can be reached by applying +1 enough times, 
and each +1 operation preserves P. If we want to prove it formally, observe 
that (5.2.1) turns into (5.1.1) if we do a change of variables and define 
Q(n) = Plz — 20). 

Here’s an example of starting at a non-zero base case: 

Theorem 5.2.1. Letn EN. Ifn > 4, then 2” > n2, 
Proof. Base case: Let n = 4, then 2” = 16 = n?. 

For the induction step, assume 2” > n?. We need to show that 2”+! > 
(n+1)? =n?+2n-+1. Using the assumption and the fact that n > 4, we 
can compute 

7 ee 
> Qn? 
se ang? 
> n?+4n 
=n? +2n+2n 
>n?+2n+1 


=(n+1)?. 
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5.3. Recursive definitions work 


In §5.1, we defined x” recursively, by giving rules for computing x° and 
computing 2”t! given x”. We can show using induction that such definitions 
actually work. 


Lemma 5.3.1. Let S be some codomain, let g: S > S, and let f(n):N> S 
satisfy 


f(0) = xo 
f(n +1) = g(f(n)) 


Then there is a unique function f with this property. 


Proof. Suppose that there is some f’ such that f’(0) = x and f’(n+1) = 
g(f'(n)). We will show by induction on n that f’(n) = f(n) for all n. The 
base case is f’(0) = xp = f(0). For the induction step, if f’(n) = f(n), then 
f'(n +1) = g(f'(n)) = g(f(n)) = fn +1). 


5.4 Other ways to think about induction 


In set-theoretic terms, the principle of induction says that if S is a subset of 
N, and both 


1. 0€S and 
2. x€Simpliesx+1€S, 


then S=N. 

This is logically equivalent to the fact that the naturals are well-ordered. 
This means that any non-empty subset S' of N has a smallest element. More 
formally: for any S CN, if S 4 @, then there exists x € S such that for all 
yES,x<y. 

It’s easy to see that well-ordering implies induction. Let S' be a subset of 
N, and consider its complement N \ S. Then either N \ S is empty, meaning 
S=N, or N\S has a least element y. But in the second case either y = 0 
and 0 ¢ S ory=2+1 for some x andxe S butx+1¢S. S0oSAN 
implies 0 ¢ S' or there exists x such that x € S but x +1 ¢S. Taking the 
contraposition of this statement gives induction. 
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The converse is a little trickier, since we need to figure out how to use 
induction to prove things about subsets of N, but induction only talks about 
elements of N. The trick is consider only the part of S that is smaller than 
some variable n, and show that any S' that contains an element smaller than 
n has a smallest element. 


Lemma 5.4.1. For alln EN, if S is a subset of N that contains an element 
less than or equal to n, then S has a smallest element. 


Proof. By induction on n. 

The base case is n = 0. Here 0 € S and O < « for any x € N, so in 
particular 0 < x for any x € S, making 0 the smallest element in S. 

For the induction step, suppose that the claim in the lemma holds for n. 
To show that it holds for n + 1, suppose that n+ 1€ S. Then either (a) S 
contains an element less than or equal to n, so S has a smallest element by 
the induction hypothesis, or (b) S does not contain an element less than or 
equal to n. But in this second case, S must contain n + 1, and since there 
are no elements less than n+ 1 in S,n-+1 is the smallest element. 


To show the full result, let n be some element of S. Then S contains an 
element less than or equal to n, and so S' contains a smallest element. 


5.5 Strong induction 


Sometimes when proving that the induction hypothesis holds for n + 1, it 
helps to use the fact that it holds for all n’ <n+1, not just for n. This sort 
of argument is called strong induction. Formally, it’s equivalent to simple 
induction: the only difference is that instead of proving Vk : P(k) > P(k+1), 
we prove Vk: (Vm <k: Q(m)) > Q(k+ 1). But this is exactly the same 
thing if we let P(k) = Vm < k : Q(m), since if Vm < k : Q(m) implies 
Q(k +1), it also implies Vm < k+1: Q(m), giving us the original induction 
formula VkP(k) > P(k +1). 

As with simple induction, it can be helpful to think of this approach 
backwards, by taking the contraposition. This gives the method of infinite 
descent, due to Fermat. The idea is to give a method for taking some ng 
for which P(no) doesn’t hold, and use it to show that there is some nj < no 
for which P(n;) also doesn’t hold. Repeating this process forever gives an 
infinite descending sequence ng > nN, > ng >..., which would give a subset 
of N with no smallest element. As with any recursive definition, the “repeat” 
step is secretly using an induction argument. 
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An alternative formulation of the method of infinite descent is that since 
the naturals are well-ordered, if there is some n for which P(n) doesn’t hold, 
there is a smallest n for which it doesn’t hold. But if we can take this n can 
find a smaller n’, then we get a contradiction. 

Historical note: Fermat may have used this technique to construct a 
plausible but invalid proof of his famous “Last Theorem” that a” + b” = c” 
has no non-trivial integer solutions for n > 2. 


5.5.1 Examples 


e Every n > 1 can be factored into a product of one or more prime 
numbers.” Proof: By induction on n. The base case is n = 2, which 
factors as 2 = 2 (one prime factor). For n > 2, either (a) n is prime 
itself, in which case n = n is a prime factorization; or (b) n is not 
prime, in which case n = ab for some a and 0, both greater than 1. 
Since a and 6 are both less than n, by the induction hypothesis we 
have a = pip2...pz for some sequence of one or more primes and 
similarly b = pp)...p,. Then n = pipo...prppo-.-py is a prime 
factorization of n. 


e Every deterministic bounded two-player perfect-information game that 
can’t end in a draw has a winning strategy for one of the players. A 
perfect-information game is one in which both players know the entire 
state of the game at each decision point (like Chess or Go, but unlike 
Poker or Bridge); it is deterministic if there is no randomness that affects 
the outcome (this excludes Backgammon and Monopoly, some variants 
of Poker, and multiple hands of Bridge), and it’s bounded if the game 
is guaranteed to end in at most a fixed number of moves starting from 
any reachable position (this also excludes Backgammon and Monopoly). 
Proof: For each position x, let b(x) be the bound on the number of 
moves made starting from x. Then if y is some position reached from « 
in one move, we have b(y) < b(x) (because we just used up a move). Let 
f(x) =1if the first player wins starting from position x and f(x) = 0 
otherwise. We claim that f is well-defined. Proof: If b(a) = 0, the 
game is over, and so f(x) is either 0 or 1, depending on who just won. If 
b(x) > 0, then f(a) = max { f(y) | y is a successor to x} if it’s the first 
player’s turn to move and f(x) = min{f(y) | y is a successor to x} if 
it’s the second player’s turn to move. In either case each f(y) is well- 
defined (by the induction hypothesis) and so f() is also well-defined. 


2A number is prime if it can’t be written as a-b where a and b are both greater than 1. 
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e The division algorithm: For each n,m € N with m ¥ 0, there is a 
unique pair g,r € N such that n =qm+rand0<r<m. Proof: Fix 
m then proceed by induction on n. If n < m, then if q > 0 we have 
n=qm+r>1-m>™m,acontradiction. So in this case g = 0 is the only 
solution, and since n = gm+r=r we have a unique choice of r =n. 
If n > m, by the induction hypothesis there is a unique q’ and r’ such 
that n—m = q'm+r’ where 0 <r’ < m. But theng=q/+1landr=r' 
satisfies gm+r = (q’—1+1)m+r = (dm+t+r’)+m= (n—m)+m=n. 
To show that this solution is unique, if there is some other q” and r” 
such that q’m-+r” =n, then (q’ —1)m+r" =n-—m=dm+r', and 
by the uniqueness of q' and r’ (induction hypothesis again), we have 
q’-l=qd =q-landr” =r’ =r, giving that q’ =q and r” =r. So 
q and r are unique. 


5.6 Recursively-defined structures 


A definition of a class of structures can often look like inductive proof, where 
we give a base case and a rule for building bigger structures from smaller 
ones. Structures defined in this way are recursively-defined. 

Examples of recursively-defined structures: 


Finite von Neumann ordinals A finite von Neumann ordinal is either 
(a) the empty set @, or (b) «U {x}, where x is a finite von Neumann 
ordinal. 


Complete binary trees A complete binary tree consists of either (a) a 
leaf node, or (b) an internal node (the root) with two complete binary 
trees as children (or subtrees). 


Boolean formulas A boolean formula consists of either (a) a variable, (b) 
the negation operator applied to a Boolean formula, (c) the AND of 
two Boolean formulas, or (d) the OR of two Boolean formulas. A 
monotone Boolean formula is defined similarly, except that negations 
are forbidden. 


Finite sequences, recursive version Before we defined a finite sequence 
as a function from some natural number (in its set form: n = {0,1,2,...,2 — 1}) 
to some set S'’. We could also define a finite sequence over S recursively, 
by the rule: () (the empty sequence) is a finite sequence, and if a is 
a finite sequence and x € S, then (2,a) is a finite sequence. (Fans of 
LISP will recognize this method immediately.) 
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The key point is that in each case the definition of an object is recur- 
sive—the object itself may appear as part of a larger object. Usually we 
assume that this recursion eventually bottoms out: there are some base cases 
(e.g. leaves of complete binary trees or variables in Boolean formulas) that 
do not lead to further recursion. If a definition doesn’t bottom out in this 
way, the class of structures it describes might not be well-defined (i.e., we 
can’t tell if some structure is an element of the class or not). 


5.6.1 Functions on recursive structures 


We can also define functions on recursive structures recursively: 


The depth of a binary tree For a leaf, 0. For a tree consisting of a root 
with two subtrees, 1 + max(d,,d2), where d, and dz are the depths of 
the two subtrees. 


The value of a Boolean formula given a particular variable assignment 
For a variable, the value (true or false) assigned to that variable. For a 
negation, the negation of the value of its argument. For an AND or 
OR, the AND or OR of the values of its arguments. (This definition is 
not quite as trivial as it looks, but it’s still pretty trivial.) 


Or we can define ordinary functions recursively: 


The Fibonacci series Let F(0) = F(1) = 1. For n > 1, let F(n) = 
Bi 1) ee a 2): 


Factorial Let 0! = 1. For n > 0, let n! =n- ((n—1))). 


5.6.2 Recursive definitions and induction 


Recursive definitions have the same form as an induction proof. There are 
one or more base cases, and one or more recursion steps that correspond to 
the induction step in an induction proof. The connection is not surprising if 
you think of a definition of some class of objects as a predicate that identifies 
members of the class: a recursive definition is just a formula for writing 
induction proofs that say that certain objects are members. 

Recursively-defined objects and functions also lend themselves easily to 
induction proofs about their properties; on general structures, such induction 
arguments go by the name of structural induction. 
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5.6.3 Structural induction 


For finite structures, we can do induction over the structure. Formally we 
can think of this as doing induction on the size of the structure or part of 
the structure we are looking at. 

Examples: 


Every complete binary tree with n leaves has n— 1 internal nodes 
Base case is a tree consisting of just a leaf; here n = 1 and there are 
n—1= 0 internal nodes. The induction step considers a tree consisting 
of a root and two subtrees. Let n, and ng be the number of leaves in 
the two subtrees; we have n, + ng = n; and the number of internal 
nodes, counting the nodes in the two subtrees plus one more for the 
root, is (ny —1) 4+ (ng-—1) +1 =n4+ng-1Ll=n-1. 


Monotone Boolean formulas generate monotone functions What this 
means is that changing a variable from false to true can never change 
the value of the formula from true to false. Proof is by induction on 
the structure of the formula: for a naked variable, it’s immediate. For 
an AND or OR, observe that changing a variable from false to true 
can only leave the values of the arguments unchanged, or change one 
or both from false to true (induction hypothesis); the rest follows by 
staring carefully at the truth table for AND or OR. 


Bounding the size of a binary tree with depth d We'll show that it 
has at most 2¢+! — 1 nodes. Base case: the tree consists of one leaf, 
d = 0, and there are 29+! -1 = 2—1 = 1 nodes. Induction step: 
Given a tree of depth d > 1, it consists of a root (1 node), plus two 
subtrees of depth at most d—1. The two subtrees each have at most 
g¢-1+1 _ 1] — 24 ] nodes (induction hypothesis), so the total number 
of nodes is at most 2(24-—1)+1=24142-1=241—1, 


Chapter 6 


Summation notation 


6.1 Summations 


Given a sequence %q,%aq41,---,Xp, its sum %q+%a41 +--+: + Lp is written as 
the summation 7o_, 2. 

The large jagged symbol is a stretched-out version of a capital Greek 
letter sigma. The variable 7 is called the index of summation, a is the 
lower bound or lower limit, and b is the upper bound or upper limit. 
Mathematicians invented this notation centuries ago because they didn’t 
have for loops; the intent is that you loop through all values of 7 from a to 
b (including both endpoints), summing up the body of the summation for 
each 1. 

If b <a, then the sum is zero. For example, 


—5 9% sini 


i=0 


This rule mostly shows up as an extreme case of a more general formula, 
e.g. 

. n(n +1) 
a 
which still works even when n = 0 or n = —1 (but not for n < —2). 

Summation notation is used both for laziness (it’s more compact to write 
yep (2t +1) than 1+3+4+5+7+---+(2n+1)) and precision (it’s also more 
clear exactly what you mean). 
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6.1.1 Formal definition 
Formally, we define a summation by the recurrence 


ifb<a 


b 
,_ fo 
> = oe +i f(@) otherwise. (6.1.1) 


In English, we can compute a summation recursively by adding the first 
value to the sum of the remaining values. 
A typical application of this definition might look like this: 


3 3 
\ z7=1+ S- a 
i=l 1=2 
3 
=142+ 0% 
i=3 
3 
=14+24+3+) 7% 
i=4 
=14+2+43+40 
=; 
In principle, we can also use the definition even if the bounds are not 
integers: 
9/4 9/4 
S- i=1/2+ Voi 
i=1/2 i=3/2 
9/4 
=1/2+3/2+ Si 
i=5/2 

=1/24+3/2+0 
= 2, 


but this is uncommon and confusing. The times when it might come up are 
when our lower bound is an integer but the upper bound might not be, as in 


> nf 2h. 

i=1 
In cases like this, many writers will often put in an explicit floor or ceiling 
(see §3.5.1) to make it explicit where the summation is supposed to stop: 


S> [n/2]i. 


i=1 


CHAPTER 6. SUMMATION NOTATION 94 


In the case where b—a is an integer, we can also compute sums by pulling 
elements off the top. This is sometimes more convenient, and is justified by 
the following lemma: 


Lemma 6.1.1. Jf b— a is an integer, then 


gee _ sj oso 
X i= ra + S otherwise. 


Proof. For b < a, (6.1.1) correctly returns 0. We will prove the remaining 
cases b > a by induction on 6 — a. 

If b— a = O, then applying (6.1.1) gives ~°_, f(i) = f(a) = f(b) 
f(b) + 21. This is our base case. 

If b—a> 0, then we can compute 


b b 
LLO=fO+ Y £0 


i=at+1 


b-1 
=fla)+ f)+ do FH 


i=at+1 
bi 
= f(b) +70), 
where the first and last steps use the definition (6.1.1) and the middle step 


uses the induction hypothesis, which holds because the gap between the 
bounds a+ 1 and bis b—(a+1)=b-—a-—-1<b-a. 


Although Lemma 6.1.1 holds whenever the difference between the bounds 
is an integer, in practice we will mostly use it when both bounds are integers. 


6.1.2 Scope 


The scope of a summation extends to the first addition or subtraction symbol 
that is not enclosed in parentheses or part of some larger term (e.g., in the 
numerator of a fraction). So 


n n nm nr 
yo?+1= (>>#) +1=14+ >>? 4 W?+1). 
i=l i=1 i=1 i=1 


Since this can be confusing, it is generally safest to wrap the sum in 
parentheses (as in the second form) or move any trailing terms to the 
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beginning. An exception is when adding together two sums, as in 


bea Su= (Se) +(E3). 


i=l i=1 


Here the looming bulk of the second sigma warns the reader that the 
first sum is ending; it is much harder to miss than the relatively tiny plus 
symbol in the first example. 


6.1.3. Summation identities 


The summation operator is linear. This means that constant factors can be 
pulled out of sums: 


m 
> an = a a (6.1.2) 
i=n =n 
and sums inside sums can be split: 
x r+ Yi) =Sa +) vi: (6.1.3) 


With multiple sums, the order of summation is not important, provided 
the bounds on the inner sum don’t depend on the index of the outer sum: 


mm mi om 
dd. ti = DD Bis 
i=n j=n’ j=n' i=n 


Products of sums can be turned into double sums of products and vice 
versa: 


m m! m om! 
(Es) (Ex) -2E ow. 
i=n j=n' i=n j=n' 


These identities can often be used to transform a sum you can’t solve 
into something simpler. 

To prove these identities, use induction and (6.1.1). For example, the 
following lemma demonstrates a generalization of (6.1.2) and (6.1.3): 


Lemma 6.1.2. 


S“(axi+ by) =a > aj +b>) y. 


i=n i=n i=n 
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Proof. If m < n, then both sides of the equation are zero. This proves 
that (6.1.2) holds for small m and gives us a base case for our induction at 
m=n-—1. 

For the induction step, we want to show that (6.1.2) holds for m+ 1 if it 
holds for m. This is a straightforward computation using (6.1.1) twice, first 
to unpack the combined sum then to repack the split sums: 


m+1 m 
S- (ax; + by) = So (ax; + by;) + (dam + bym) 
i=n i=n 
m m 
=a) > a+b) y+ atm t+ bym 
i=n =n 
m mm 
=a (So + 2m) +b (321+ un] 
i=n i=n 
m+1 m+1 


6.1.4 Choosing and replacing index variables 


When writing a summation, you can generally pick any index variable you 
like, although 7, 7, k, etc., are popular choices. Usually it’s a good idea to 
pick an index that isn’t used outside the sum. Though 


n n 
dn= doi 
n=0 i=0 


has a well-defined meaning, the version on the right-hand side is a lot 
less confusing. 

In addition to renaming indices, you can also shift them, provided you 
shift the bounds to match. For example, rewriting 


So-1) 


i=1 


by substituting 7 for 7 — 1 gives 
n-1 
pee 
j=0 


which is easier to work with. 
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6.1.5 Sums over given index sets 


Sometimes we’d like to sum an expression over values that aren’t consecutive 
integers, or may not even be integers at all. This can be done using a sum 
over all indices that are members of a given index set, or in the most general 
form satisfy some given predicate (with the usual set-theoretic caveat that 
the objects that satisfy the predicate must form a set). Such a sum is written 
by replacing the lower and upper limits with a single subscript that gives 
the predicate that the indices must obey. 
For example, we could sum 7? for i in the set {3,5,7}: 


So 7 $974.57 +7 = 83. 
i€{3,5,7} 


Or we could sum the sizes of all subsets of a given set S: 


IAL 


ACS 


Or we could sum the inverses of all prime numbers less than 1000: 


1/p. 
p < 1000, p is prime 


Sometimes when writing a sum in this form it can be confusing exactly 
which variables are the indices. The usual convention is that a variable is 
always an index if it doesn’t have any meaning outside the sum, and the 
index variable is put first in the expression under the sigma if possible. If it 
is not obvious what a complicated sum means, it is generally best to try to 
rewrite it to make it more clear. Still, you may see sums that look like 


sD 


a 
1<i<j<nJ 


>» IAI 


reEACS 


or 


where the first sum sums over all pairs of values (7,7) such that 1 < 3, 
i<j, and 7 <n, with each pair appearing exactly once; and the second 
sums over all sets A that are subsets of S and contain x (assuming x and 
S are defined outside the summation). Hopefully, you will not run into too 
many sums that look like this, but it’s worth being able to decode them if 
you do. 
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Sums over a given set are guaranteed to be well-defined only if the set is 
finite. In this case we can use the fact that there is a bijection between any 
finite set S and the ordinal |.S| to rewrite the sum as a sum over indices in |]. 
For example, if |.S'| =n, then there exists a bijection f : {0...n—1} 6S, 
so we can define 


n-1 
ys = S- tf (i): (6.1.4) 
iES i=0 


This allows us to apply (6.1.1) to decompose the sum further: 


0 ifS=9, 
dm ~ oo xi) +O, U2 &: ee 


1ES 
The idea is that for any particular z € S', we can always choose a bijection 
that makes z = f (|S| — 1). 


If S is infinite, computing the sum is trickier. For countable S$, where 
there is a bijection f : N © S, we can sometimes rewrite 


[o-e) 
dt = Dt: 
ieS i=0 
and use the definition of an infinite sum (given below). Note that if the 
x; have different signs the result we get may depend on which bijection we 
choose. For this reason such infinite sums are probably best avoided unless 
you can explicitly use N or a subset of N as the index set. 


6.1.6 Sums without explicit bounds 


When the index set is understood from context, it is often dropped, leaving 
only the index, as in >; i?. This will generally happen only if the index spans 
all possible values in some obvious range, and can be a mark of sloppiness 
in formal mathematical writing. Theoretical physicists adopt a still more 
lazy approach, and leave out the 5°; part entirely in certain special types 
of sums: this is known as the Einstein summation convention after the 
notoriously lazy physicist who proposed it. 


6.1.7 Infinite sums 


Sometimes you may see an expression where the upper limit is infinite, as in 


aon | 


Es 


i=0 
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The meaning of this expression is the limit of the series s obtained by 
taking the sum of the first term, the sum of the first two terms, the sum 
of the first three terms, etc. The limit converges to a particular value « if 
for any € > 0, there exists an N such that for all n > N, the value of s,, is 
within ¢€ of x (formally, |s, — 2| < €). We will see some examples of infinite 
sums when we look at generating functions in §11.3. 


6.1.8 Double sums 


Nothing says that the expression inside a summation can’t be another 
summation. This gives double sums, such as in this rather painful definition 
of multiplication for non-negative integers: 


a b 
axbZy°¥o1 


i=1 j=l 


If you think of a sum as a for loop, a double sum is two nested for loops. 
The effect is to sum the innermost expression over all pairs of values of the 
two indices. 

Here’s a more complicated double sum where the limits on the inner sum 
depend on the index of the outer sum: 


dG + YG +2. 
i=0 j=0 
When n = 1, this will compute (0+1)(0+1)+(14+1)(0+1)+(14+1)(141) = 
7. For larger n the number of terms grows quickly. 


6.2 Products 


What if you want to multiply a series of values instead of add them? The 
notation is the same as for a sum, except that you replace the sigma with a 
pi, as in this definition of the factorial function for non-negative n: 


n 
nl [Jiai-2---en. 


i=1 


The other difference is that while an empty sum is defined to have the 
value 0, an empty product is defined to have the value 1. The reason for 
this rule (in both cases) is that an empty sum or product should return the 
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identity element for the corresponding operation—the value that when 
added to or multiplied by some other value x doesn’t change x. This allows 
writing general rules like: 


VIO+ VM AO= YE FO 


ieA i€B i€ AUB 
(1 ra) (I ra) = [IT fe 
ie A i€B i€ AUB 


which holds as long as AN B = 9. Without the rule that the sum of an 
empty set was 0 and the product 1, we’d have to put in a special case for 
when one or both of A and B were empty. 

Note that a consequence of this definition is that 0! = 1. 


6.3. Other big operators 


Some more obscure operators also allow you to compute some aggregate over 
a series, with the same rules for indices, lower and upper limits, etc., as > 
and J]. These include: 


e Big AND: 
\ P(x) = P(a1) A P(ae) A... = Vx eS: P(z). 
res 

e Big OR: 
VV P(x) = P(#1) V P(e) V...=4r eS: P(z). 
res 


e Big Intersection: 


() Ai = Al Ag... An. 


i=l 
e Big Union: 
J Ag = Ar U AgU...U Ap. 


i=l 
These all behave pretty much the way one would expect. One issue that 
is not obvious from the definition is what happens with an empty index set. 
Here the rule as with sums and products is to return the identity element 
for the operation. This will be True for AND, False for OR, and the empty 
set for union; for intersection, there is no identity element in general, so the 
intersection over an empty collection of sets is undefined. 
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6.4 Closed forms 


When confronted with some nasty sum, it is nice to be able to convert 
into a simpler expression that doesn’t contain any summation signs or other 
operators that iterate over some bound variable. Such an expression is known 
as a Closed form. 

It is not always possible to do this: the techniques available are mostly 
limited to massaging the summation until it turns into something whose 
simpler expression you remember.! 

To do this, it helps to have both (a) a big toolbox of summations with 
known values, and (b) rules for manipulating summations to get them into a 
more convenient form. We'll start with the toolbox. 


6.4.1 Some standard sums 


Here are the three formulas you should either memorize or remember how to 
derive: 


i=1 

Hr — n(n+1) 
i=1 2 
ss _ 1 = = 
i=0 


Rigorous proofs of these can be obtained by induction on n. The first 
one is pretty easy. 

A not so rigorous proof of the second identity can be given using a trick 
alleged to have been invented by the legendary 18th-century mathematician 
Carl Friedrich Gauss, at a frighteningly early age, by adding up two copies 
of the sequence running in opposite directions, one term at a time: 


SS 1 + 2 = wae SE n 
S = n 5 et es 1 
25 = (n+1) + (n4+1) +... + (n41) = n(n+)), 
and from 2S = n(n+1) we get S = n(n +1)/2. One way to remember this 
is that the average value in the sequence is nit the average of the values at 


the ends. We can then multiple by the number of values n to get the total. 


‘If you have done integrals in calculus (see §H.3, this process will be unpleasantly 
familiar. 
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For the last identity, start with 
1 


_ a 
i_ 
27 “Te 


which holds when |r| < 1. The proof is that if 


S= Yr 
i=0 


then 
CO [o-@) 
rS= er = yr 
i=0 i=l 
and so 


S-—rS=r°=1. 


Solving for S gives S=1/(1—r). 
We can now get the sum up to n by subtracting off the extra terms 
starting with rn + 1: 


1 prtl _ prtl 


n . es . i<— . 
1=0 1=0 i=0 


Though this particular proof only works for |r| < 1, the formula works 
for any r not equal to 1.7 If r is equal to 1, then the formula doesn’t work (it 
requires dividing zero by zero), but there is an easier way to get the solution. 

These standard summations can be combined with linearity to solve more 
complicated problems. For example, we can directly compute 


n 


n n 
> (3-2° +5) =35°2"+5501 
i=0 i=0 


i=0 
=3- (241-1) +5(n +1) 
= 3-2") 4 5n 42. 
Other useful summations can be found in various places. Rosen [Ros12] 
and Graham et al. [GIs P94] both provide tables of sums in their chapters on 


generating functions. But it is usually better to be able to reconstruct the 
solution of a sum rather than trying to memorize such tables. 


" a —rnt1 = 
Proof: By induction on n. For n = 0, the formula gives 4D = Et =1= 
(a n 4 qarrtt ms tar grt —rrtl — 1=r™ r’(l-r) __ 
r= >So 7. For larger n, compute "1. _— ae oe 


Ltr = a 
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6.4.2 Guess but verify 


If nothing else works, you can try using the “guess but verify” method, which 
also works more generally for identifying sequences defined recursively. Here 
we write out the values of the summation for the first few values of the upper 
limit (for example), and hope that we recognize the sequence. If we do, we 
can then try to prove that a formula for the sequence of sums is correct by 
induction. 

Example: Suppose we want to compute 


S(n) = $0 (2k -1) 
k=1 


but that it doesn’t occur to us to split it up and use the )°/_, k and 
7_, 1 formulas. Instead, we can write down a table of values: 
k=1 


n S(n) 

0 O 

1 1 

2 14+3=4 

3 1+4+3+5=9 

4 1+3+5+7=16 


5 14+34+54+74+9=25 
At this point we might guess that S(n) = n?. To verify this, observe that 
it holds for n = 0, and for larger n we have S(n) = S(n—1)+ (2n—1) = 
(n—1)?+2n—-—1=n? —2n+1-2n—1=n?. So we can conclude that our 
guess was correct. 


6.4.3. Ansatzes 


A slightly more sophisticated approach to guess but verify involves guessing 
the form of the solution, but leaving a few parameters unfixed so that we can 
adjust them to match the actual data. This parameterized guess is called 
an ansatz, from the German word for “starting point,” because guesswork 
sounds much less half-baked if you can refer to it in German. 

To make this work, it helps to have some idea of what the solution to a 
sum might look like. One useful rule of thumb is that a sum over a degree-d 
polynomial is usually a degree-(d + 1) polynomial. 

For example, let’s guess that 


n 
S- i? = cgn® + con? +.c1n + ©, (6.4.1) 
i=0 
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when n > 0. 


Under the assumption that (6.4.1) holds, we can plug in n = 0 to get 
0 32 


i-o 2° =0= co. This means that we only need to figure out c3, co, and cy. 


Plugging in some small values for n gives 


O4+1=1l=ctot+aq 
0+1+4=5 = 8c3 + 4cq + 2c1 
0414449 = 14 = 27c3 4+ 8c2 + 3c) 
With some effort, this system of equations can be solved to obtain 
c3 = 1/3, cg = 1/2,c, = 1/6, giving the formula 
n 


1 1 1 
2 3 2 
—o— — —n. 6.4.2 
De mu +50 =n ( ) 


This is often written as 


“9 (2n4+1)n(n4+1) 
= 6 , 


(6.4.3) 
1=0 


which is the same formula, just factored. 

We still don’t know that (6.4.2) actually works, since we only looked at 
the first four values in the sequence. To show that it does work, we do an 
induction argument. 

The base case is n = 0, which we know works. For the induction step, 
compute 


1 1 1 1 1 1 if 
gt Srl) oa (Gn? +n? +n4 2) + (Ge +n+5) +( 


3 3 2 2 


tnd + an? + ont n2+2n41 
=r7n <n =n nm nr 
3 2 |G 


n 
= yr + (n+1)? 
i=0 


n+1 


oats 
i=0 


1 


6 


1 


n+-a 


6 


) 


Chapter 7 


Asymptotic notation 


Asymptotic notation is a tool for describing the behavior of functions on 
large values, which is used extensively in the analysis of algorithms. 


7.1 Definitions 


O(f(n)) A function g(n) is in O(f(n)) (“big O of f(n)”) if there exist 
constants c > 0 and N such that |g(n)| < c|f(n)| for alln > N. 


Q(f(n)) A function g(n) is in Q(f(n)) (“big Omega of f(n)”) if there exist 
constants c > 0 and N such that |g(n)| > c|f(n)| for alln > N. 


O(f(n)) A function g(n) is in O(f(n)) (“big Theta of f(n)”) if there exist 
constants c1 > 0, cz > 0, and N such that ci|f(n)| < |g(n)| < c2|f(n)| 
for alln > N. This is equivalent to saying that g(n) is in both O(f(n)) 


and Q(f(n)). 


o(f(n)) A function g(n) is in o(f(n)) (“little o of f(n)”) if for every c > 0 
there exists an N such that |g(n)| < c|f(n)| for all n > N. This is 
equivalent to saying that limp... g(n)/f(n) = 0. 


w(f(n)) A function g(n) is in w(f(n) (“little omega of f(n)”) if for every 
c > 0 there exists an N such that |g(n)| > c|/f(n)| for all n > N. This 
is equivalent to saying that limp_..0.|g(n)|/|f(n)| diverges to infinity. 


7.2 Motivating the definitions 


Why would we use this notation? 


105 
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e Constant factors vary from one machine to another. The c factor hides 
this. If we can show that an algorithm runs in O(n”) time, we can be 
confident that it will continue to run in O(n?) time no matter how fast 
(or how slow) our computers get in the future. 


e For the N threshold, there are several excuses: 


— Any problem can theoretically be made to run in O(1) time for 
any finite subset of the possible inputs (e.g. all inputs expressible 
in 50 MB or less), by prefacing the main part of the algorithm with 
a very large table lookup. So it’s meaningless to talk about the 
relative performance of different algorithms for bounded inputs. 


— If f(n) > 0 for all n, then we can get rid of N (or set it to zero) by 
making c large enough. But some functions f(n) take on zero—or 
undefined—values for interesting n (e.g., f(n) =n? is zero when 
n is zero, and f(n) = logn is undefined for n = 0 and zero for 
n= 1). Allowing the minimum N lets us write O(n?) or O(log n) 
for classes of functions that we would otherwise have to write 
more awkwardly as something like O(n? + 1) or O(log(n + 2)). 


— Putting the n > N rule in has a natural connection with the 
definition of a limit, where the limit as n goes to infinity of g(n) 
is defined to be x if for each € > O there is an N such that 
|g(n) — 2| < € for alln > N. Among other things, this permits 
the limit test that says g(n) = O(f(n)) if the limp + el exists 
and is finite. 


7.3 Proving asymptotic bounds 


Most of the time when we use asymptotic notation, we compute bounds using 
stock theorems like O(f(n)) + O(g(n)) = O(max(f(n), g(n)) or O(cef(n)) = 
O(f(n)). But sometimes we need to unravel the definitions to see whether 
a given function fits in a given class, or to prove these utility theorems to 
begin with. So let’s do some examples of how this works. 


Theorem 7.3.1. The function n is in O(n). 


Proof. We must find c, N such that for all n > N, |n| < e|n®|. Since n? 
is much bigger than n for most values of n, we’ll pick c to be something 
convenient to work with, like 1. So now we need to choose N so that for all 
n> N, |n| < |n3|. It is not the case that |n| < |n?| for all n (try plotting 
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n vs n° for n <1) but if we let N = 1, then we have n > 1, and we just 
need to massage this into n® > n. There are a couple of ways to do this, 
but the quickest is probably to observe that squaring and multiplying by n 
(a positive quantity) are both increasing functions, which means that from 
n > 1 we can derive n? > 1? = 1 and then n?-n=n?>1-n=n. 


Theorem 7.3.2. The function n° is not in O(n). 


Proof. Here we need to negate the definition of O(n), a process that turns 
all existential quantifiers into universal quantifiers and vice versa. So what 
we need to show is that for all c > 0 and N, there exists some n > N for 
which |n*| is not less than c|n|. So fix some such c > 0 and N. We must find 
an n> N for which n? > cn. Solving for n in this inequality gives n > c!/?; 
so setting n > max(N, cl/?) finishes the proof. 


Theorem 7.3.3. If fi(n) is in O(g(n)) and fo(n) ts in O(g(n)), then fi(n)+ 
fo(n) is in O(g(n)). 


Proof. Since f;(n) is in O(g(n)), there exist constants c,, Ni such that for 
alln > Ni, |fi(n)| < elg(n)|. Similarly there exist co, No such that for all 
n > Na, |faln)| < elg(n)). 

To show fi(n) + fo(n) in O(g(n)), we must find constants c and N such 
that for all n > N, |fi(nm) + fo(n)| < elg(n)|. Let’s let ¢c = cy + c2. Then 
if n is greater than max(Nj, No), it is greater than both N; and No, so we 
can add together |fi| < ci|g| and | f2| < calg| to get | fi + fel < [fil + [fel < 
(cr + ¢2)Ig| = elgl- 


7.4 General principles for dealing with asymptotic 
notation 


7.4.1 Remember the difference between big-O, big-Q, and 
big-O 


e Use big-O when you have an upper bound on a function, e.g. the zoo 
never got more than O(1) new gorillas per year, so there were at most 
O(t) gorillas at the zoo in year t. 


e Use big-Q when you have a lower bound on a function, e.g. every year 
the zoo got at least one new gorilla, so there were at least Q(t) gorillas 
at the zoo in year t. 
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e Use big-O when you know the function exactly to within a constant- 
factor error, e.g. every year the zoo got exactly five new gorillas, so 
there were Q(t) gorillas at the zoo in year t. 


For the others, use little-o and w when one function becomes vanishingly 
small relative to the other, e.g. new gorillas arrived rarely and with declining 
frequency, so there were o(t) gorillas at the zoo in year t. These are not used 
as much as big-O, big-Q, and big-O© in the algorithms literature. 


7.4.2 Simplify your asymptotic terms as much as possible 


e O(f(n)) + O(g(n)) = O(f(n)) when g(n) = O(f(n)). If you have an 
expression of the form O(f(n) + g(n)), you can almost always rewrite 
it as O(f(n)) or O(g(n)) depending on which is bigger. The same goes 
for Q and O. 


e O(cf(n)) = O(f(n)) if c is a constant. You should never have a 
constant inside a big O. This includes bases for logarithms: since 
log, x = log, z/log,a, you can always rewrite O(lgn), O(Inn), or 
O(log; 4467712 2) as just O(log n). 


e But watch out for exponents and products: O(3"n?117 log!/8 n) is 
already as simple as it can be. 


7.4.3 Use limits (may require calculus) 


If you are confused whether e.g. logn is O(n), try computing the limit as n 
goes to infinity of 2", and see if it converges to a constant (zero is OK). 
The general rule is that f(n) is O(g(n) if limp+oo oe exists.! 

You may need to use L’H6pital’s Rule to evaluate such limits if they 


aren’t obvious. This says that 


n 


fm _, f(r) 


j NISL) Zs 


n—+0o g(n) ~~ n-300 g'(n) 


when f(n) and g(n) both diverge to infinity or both converge to zero. Here 
f’ and g’ are the derivatives of f and g with respect to n; see §H.2. 


‘Note that this is a sufficient but not necessary condition. For example, the function 
f(n) that is 1 when n is even and 2 when n is odd is O(1), but limn—oo fin) doesn’t exist. 
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7.5 Asymptotic notation and summations 


Algorithms often involve loops, where the cost of the loop is the sum of the 
costs of each iteration. When we are looking for an asymptotic cost, we 
don’t need to compute an exact value for this sum, but can instead use an 
approximation that is accurate up to constant factors. This can make our 
life much easier. 

Here’s my usual strategy for computing sums in asymptotic form: 


7.5.1 Pull out constant factors 


Pull as many constant factors out as you can (where constant in this case 
means anything that does not involve the summation index). Example: 
ey F =n, } =n, = O(nlogn). (See harmonic series below.) 


7.5.2 Bound using a known sum 


See if it’s bounded above or below by some other sum whose solution you 
already know. Here are some good sums to try (some of these previously 
appeared in §6.4). 


7.5.2.1 Geometric series 


es ee 

= Age a-1 
and 

Co 

Soa 

i=0 l-gz 


The way to recognize a geometric series is that the ratio between adjacent 
terms is constant. If you memorize the second formula, you can rederive the 
first one. If you’re Gauss, you can skip memorizing the second formula. 

A useful fact about geometric series is that if x is a constant that is not 
exactly 1, the sum is always big-Theta of its largest term. So for example 
yi, 2! = O(2”) (the exact value is 2"t! — 1), and 77, 27* = O(1) (the 
exact value is 1 — 27”). 

If the ratio between terms equals 1, the formula doesn’t work; instead, 
we have a constant series (see below). 


CHAPTER 7. ASYMPTOTIC NOTATION 110 


7.5.2.2 Constant series 


7.5.2.3. Arithmetic series 


The simplest arithmetic series is 


The way to remember this formula is that it’s just n times the average value 
(n+1)/2. The way to recognize an arithmetic series is that the difference 
between adjacent terms is constant. The general arithmetic series is of the 
form 


S\(ai +b) = So ai+ b 
i=1 i=l i=l 


Because the general series expands so easily to the simple series, it’s usually 
not worth memorizing the general formula. 
In asymptotic terms, every arithmetic series is O(n?). 


7.5.2.4 Harmonic series 
n 
S- 1/i = H, = O(nlogn). 
i=l 


Can be rederived using the integral technique given below or by summing 
the last half of the series, so this is mostly useful to remember in case you 
run across H,, (the “n-th harmonic number”). 

The infinite sum 5°72, 1/7 diverges: even though it grows very slowly as i 
gets larger, adding enough terms will eventually exceed any constant bound. 

The value of the more general infinite sum )°7°, 1/7* is called ¢(s), and 
¢ is called the Riemann zeta function. The harmonic series is the case 
where s = 1, and because it diverges, s, ¢(1) is undefined. However, ¢(s) is 
defined for s > 1. The exact value can be hard to compute,” but as long as 
s does not depend on n, it’s O(1) for any fixed s > 1. 


*Finding just the value of ¢(2) = )>, 1/i? = 17/6 was known as the Basel problem 
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7.5.3. Bound part of the sum 


See if there’s some part of the sum that you can bound. For example, 377, 73 
has a (painful) exact solution, or can be approximated by the integral trick 
described below, but it can very quickly be solved to within a constant factor 
by observing that 7°, 8 < 7, n? = O(n‘) and >, 8 > Sea ae 
Dien/2(n/2)? = Qn’). 


7.5.4 Integrate 


Integrate. If f(n) is non-decreasing and you know how to integrate it, then 
‘4 (a) dx < o_o fi) < Vai f(x) dx, which is enough to get a big-Theta 
bound for almost all functions you are likely to encounter in algorithm 


analysis. If you don’t know how to integrate, see §H.3. 


7.5.5 Grouping terms 


Try grouping terms together. For example, the standard trick for showing 
that the harmonic series is unbounded in the limit is to argue that 1+ 1/2 + 
1/3+1/4+1/5+1/64+1/74+1/8+...>1+41/2+ (1/4+1/4) + (1/84 
1/8+1/8+1/8)+...>141/24+1/2+1/2+.... I usually try everything 
else first, but sometimes this works if you get stuck. 

Warning: Though it’s always safe to reorder terms in a finite sum, bad 
things can happen if you reorder an infinite sum that includes both positive 
and negative terms. 


7.5.6 An odd sum 


One oddball sum that shows up occasionally but is hard to solve using 
any of the above techniques is 7", a‘i. If a < 1, this is O(1) (the exact 
formula for 792, a’i when a < 1 is a/(1 — a)?, which gives a constant upper 
bound for the sum stopping at n); if a = 1, it’s just an arithmetic series; if 
a > 1, the largest term dominates and the sum is O(a”n) (there is an exact 
formula, but it’s ugly—if you just want to show it’s O(a"n), the simplest 
approach is to bound the series so a”—*(n — i) by the geometric series 
yp a’—'n < an/(1 — a~!) = O(a"n). I wouldn’t bother memorizing this 
one provided you remember how to find it in these notes. 


and took 90 years to solve. When s can be a complex number, showing that ¢(s) = 0 only 
if s is a negative even integer or of the form 1/2 + ai is the Riemann hypothesis and 
has not yet been proven or disproven since it was first proposed in 1859. You will not be 
asked to solve either of these problems in this class. 
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7.5.7 Final notes 


In practice, almost every sum you are likely to encounter in algorithm 
analysis will be of the form >, f(n) where f(n) is exponential (so that 
it’s bounded by a geometric series and the largest term dominates) or 
polynomial (so that f(n/2) = O(f(n))) and the sum is O(n f(n)) using the 
Vien/2 f(r) = Q(nf(n)) lower bound). 

Graham et al. [GIKP94] spend a lot of time on computing sums exactly. 
The most generally useful technique for doing this is to use generating 
functions (see §11.3). 


7.6 Variations in notation 


As with many tools in mathematics, you may see some differences in how 
asymptotic notation is defined and used. 


7.6.1 Absolute values 


Some authors leave out the absolute values. For example, Biggs [Big()2] 
defines f(n) as being in O(g(n)) if f(n) < cg(n) for sufficiently large n. If 
f(n) and g(n) are non-negative, this is not an unreasonable definition. But it 
produces odd results if either can be negative: for example, by this definition, 
—n}000 is in O(n”). Some authors define O, Q, and © only for non-negative 
functions, avoiding this problem. 

The most common definition (which we will use) says that f(n) is in 
O(g(n)) if | f(n)| < elg(n)| for sufficiently large n; by this definition —n10° is 
not in O(n), though it is in O(n!°°). This definition was designed for error 
terms in asymptotic expansions of functions, where the error term might 
represent a positive or negative error. 


7.6.2 Abusing the equals sign 


Formally, we can think of O(g(n)) as a predicate on functions, which is true of 
all functions f(n) that satisfy f(n) < cg(n) for some c and sufficiently large 
n. This requires writing that n? is O(n?) where most computer scientists 
or mathematicians would just write n? = O(n”). Making sense of the latter 
statement involves a standard convention that is mildly painful to define 
formally but that greatly simplifies asymptotic analyses. 

Let’s take a statement like the following: 


O(n”) + O(n?) +1 = O(n’). 
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What we want this to mean is that the left-hand side can be replaced 
by the right-hand side without causing trouble. To make this work formally, 
we define the statement as meaning that for any f in O(n?) and any g in 
O(n?), there exists an h in O(n®) such that f(n) + g(n) +1 = A(n). 

In general, any appearance of O, Q, or © on the left-hand side gets 
a universal quantifier (for all) and any appearance of O, Q, or © on the 
right-hand side gets an existential quantifier (there exists). So 


f(n) + of f(n)) = O(F(n)) 


means that for any g in o(f(n)), there exists an h in O(f(n)) such that 
f(n) +9(n) = h(n), and 


O(f(n)) + O(g(n)) + 1 = O(max(f(n), g(m))) +1 


means that for any r in O(f(n)) and s in O(g(n)), there exists t in O(max(f(n), g(n)) 
such that r(n) + s(n) +1=t(n) 41. 

The nice thing about this definition is that as long as you are careful about 
the direction the equals sign goes in, you can treat these complicated pseudo- 
equations like ordinary equations. For example, since O(n”) +O(n?) = O(n’), 
we can write 


n n n(n + Nae?) = O(n2) + O(n?) 


= O(n’), 


which is much simpler than what it would look like if we had to talk about 
particular functions being elements of particular sets of functions. 

This is an example of abuse of notation, the practice of redefining 
some standard bit of notation (in this case, equations) to make calculation 
easier. It’s generally a safe practice as long as everybody understands what 
is happening. But beware of applying facts about unabused equations to the 
abused ones. Just because O(n”) = O(n?) doesn’t mean O(n?) = O(n?)—the 
big-O equations are not reversible the way ordinary equations are. 

More discussion of this can be found in [Fer08, §10.4] and [GK P94, 
Chapter 9]. 


Chapter 8 


Number theory 


Number theory is the study of the natural numbers, particularly their 
divisibility properties. Nowadays this often involves bringing in the integers 
as well, since being able to subtract can be handy. But the ultimate goal is 
to understand the naturals. 

If you read about number theory elsewhere, you may find a mismatch 
between our definition of N (which includes 0) and the definition favored 
by number theorists (which doesn’t). Number theorists like to leave 0 out 
because otherwise many theorems about numbers would require annoying 
“except 0” clauses. We will write N* for the positive natural numbers 
{n € N|n> 0}, which excludes 0. This is the same set as the positive 
integers Z*, which are {x € Z| x > O}.! 

The natural numbers N have commutative and associative addition and 
multiplication operations, where each has an identity and multiplication 
distributes over addition.? This makes them a commutative semiring. If 
we had additive inverses as well,’ as in Z, we would get commutative ring. 
One way to remember that N is a semiring while Z is a ring is that N is 
what you get after chopping off half of Z. 


"In general, N*, Zt, Q*, and R® all refer to the positive elements of each set, while 
Z~,Q , and R™ are the negative elements. None of these sets includes 0, which is neither 
positive nor negative. The set N~ of negative natural numbers is technically well-defined, 
but since it is empty it doesn’t come up much. 

?Formally, N satisfies Axioms 4.1.1, 4.1.2, 4.1.3, 4.1.7, 4.1.8, 4.1.9, and 4.1.11. 

3 Axiom 4.1.5. 
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8.1 Divisibility 


Except for the identity elements 0 and 1, no natural number has an additive 
or multiplicative inverse. No multiplicative inverses means that we can’t, in 
general, divide a natural number n by another natural number m: given n 
and m # 0, there is no guarantee that we can write n as gm for some q in N. 
If there is such a q, then n is divisible by m, although we usually write this 
in the reversed direction by saying that m divides n, written as m | n. 

If m | n, m is said to be a factor or divisor of n. A number greater 
than 1 whose only factors are 1 and itself is called prime. Non-primes that 
are greater than 1 are called composite. The remaining natural numbers 0 
and 1 are by convention neither prime nor composite; this allows us to avoid 
writing “except 0 or 1” in a lot of places later. 

We can use the same definition of divisibility for integers, by letting m 
divide n if there is an integer k such that km = n. This gives m | n if and 
only |m| divides |n|. This does have some odd consequences, like —7 being 
prime. The integer —1 gets the same special exemption as 1—both are units, 
numbers that, because they divide the identity, are considered neither prime 
nor composite. 

Some useful facts about divisibility: 


e Ifd| mand d|n, then d| (m+n). Proof: Let m = ad and n = bd, 
then (m+n) = (a+ b)d. 


e Ifd|nandn #0, then d < n. Proof: n = kd 4 0 implies k > 1 
implies n = kd > d. 


e For all d, d| 0. Proof: 0-d=0. 


e If d\m or d\n, then d|mn. Proof: Suppose m = kd, then mn = (nk)d. 
Alternatively, if n = kd, then mn = (mk)d. 


e If pis prime, then p | ab if and only if p | a or p| b. Proof: Surprisingly 
difficult. We’ll get this as a consequence of the extended Euclidean 
algorithm in §8.4.2. 


8.2 The division algorithm 


If m does not divide n, then any attempt to divide n things into m equal 
piles will leave some things left over. In this case, we can use an extended 
version of division that expresses n as gm +r, where q is the quotient of n 
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and m and r is a remainder satisfying 0 < r < m. The fact that we can do 
this is a consequence of the division algorithm, due to Euclid. 

The division algorithm yields for any pair of integers (which might or 
might not be natural numbers) n and m ¥ 0 a unique integer quotient q and 
remainder r such that n= qm+rand0<r< |ml. 

For positive m, the quotient is often written as [n/m], the floor of n/m, 
to make it clear that we want the integer version of the quotient and not 
some nasty fraction; for negative m, we’ll get the ceiling [n/m] instead. 
The remainder is often written as (n mod m), pronounced “the remainder of 
n modulo m” when paid by the word but usually just “n mod m.” Saying 
that n mod m = 0 is the same as saying that m divides n (m | n for short). 

For non-negative n and m, we can find q and r recursively. If n is already 
less than m, we can set g = 0 and r = n, while for larger n, we can compute 
n—-m=qm-+r recursively and then set ¢g = q+ 1. Showing that this 
algorithm works is an application of strong induction.* 


Theorem 8.2.1 (Division algorithm). Let n,m be integers with m #0. Then 
there exist unique integers q andr such that0 <r <|m| andn=qm+r. 


Proof. First we show that q and r exist for n > 0 and m > 0. This is done 
by induction on n. If n < m, then g = 0 and r = n satisfies n = qm+r and 
O0<r<m. Ifn>m, thenn—m>0and n—m <n, so from the induction 
hypothesis there exist some q’, r such that n —-m=qdm+rand0<r<m. 
Then if g= q+1, we haven = (n—m)+m=qdm+r+m= (q4+1)m+r= 
qm +r. 

Next we extend to the cases where n might be negative. If n < 0 and 
m > 0, then there exist q’,r’ with 0 < r < m such that —n = q’m+r. If 
r’ = 0, let gq = —q’ and r = 0, giving n = —(—n) = —(q’m +1’) = qm+r. 
If r’ £0, let g= —qd — 1 and r = m-—r; now n = —(—n) = —(dm+7’) = 
—(—(q+1)m+(m-—r)) = —(—qm-r) = qm+r. So in either case appropriate 
q and r exist. 

Finally, we consider the case where m is negative. Let n = q/(—m) +7, 
where 0 < r < —m. Let ¢q = —q’. Then n = qd (—m) +r = (-d)-m+r= 
qm +r. 

So far we have only shown that q and r exist; we haven’t shown that 
they are unique. For uniqueness, suppose that n = gm +r = q'm+r", where 
O<r<r' <|mj. Then (q¢’m+r’) —(qm+r) =0, which we can rearrange 
to get r’ —r =(q— d)m. In particular, m | (r’ — r), so there exists some k 


‘Repeated subtraction is not a very good algorithm for division, but it’s what Euclid 
used, and since we don’t care about efficiency we'll stick with it. 
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such that r’ —r =k-|m|. If k = 0, then r’ = r, from which we can conclude 
d=q. fk40,k>1,s0r’ >r’—r > |ml, contradicting the requirement 
that r’ < |m|. 


Note that quotients of negative numbers always round down. For example, 
|(—3)/17] = —1 even though —3 is much closer to 0 than it is to —17. This 
is so that the remainder is always non-negative (14 in this case). This may 
or may not be consistent with the behavior of the remainder operator in 
your favorite programming language. 


8.3 Modular arithmetic and residue classes 


From the division algorithm, we have that for each pair of integers n and 
m # 0, there is a unique remainder r with 0 < r < |m| and n = qm-+r for 
some q; this unique r is written as (n mod m). Define n =, n’ (read “n is 
congruent to n’ mod m”) if (n mod m) = (n’ mod m), or equivalently if 
there is some gq € Z such that n = n’ + qm. 

The set of integers congruent to n mod m is called the residue class 
of n (residue is an old word for remainder), and is written as [n],,. The 
sets [0]m,[1]m,.--|m—1]m between them partition the integers, and the set 
{[O]m:; [1]m,---[m™— 1]m} defines the integers mod m, written Z,,. We will 
see that Z,, acts very much like Z, with well-defined operations for addition, 
subtraction, and multiplication, making it a commutative ring. In the 
case where the modulus is prime, we even get division: Z, is a finite field 
for any prime p. 

The most well-known instance of Z,, is Z2, the integers mod 2. The class 
[Oj2 is the even numbers and the class [1]2 is the odd numbers. 


8.3.1 Arithmetic on residue classes 


We can define arithmetic operations on residue classes in Z,, just as we defined 
arithmetic operations on integers (defined as equivalence classes of pairs of 
naturals). Given residue classes [2], and [y]m, define [t]m + [y]m = [a+ y|m; 
where the addition in the right-hand side is the usual integer addition in 
Z. Here x and y act as representatives for their respective residue classes. 
It’s unusual to write out the brackets; instead, when working in Z, we just 
use representatives, or write something like 2+ 3 = 1 (mod 4) if we want to 
emphasize that we are taking remainders after each operation. 

For example, in Zz we have 0+0=0,0+1=1,1+0=1,and1+1=0 
(since 1+ 1 = 2 € [0],,). For this to make sense, we must verify that this 
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definition of addition is well-defined: in particular, it shouldn’t matter which 
representatives x and y we pick. 
To prove this, let’s start with an alternative characterization of when 


L =m Y: 


Lemma 8.3.1. Let x,y € Z and let mE N*. Then x =m y if and only if 
m|(a—y). 


Proof. Write « = qm+r, y= sm-+t, where 0 <1r,s <m. Then «—-y= 
(¢q—s)m+(r—t). If m|(x—y), then m|(r—t); since -m<r-—t<m 
this implies r — t = 0 and thus x mod m = r = t = ymod m. Conversely, 
suppose « mod m = y mod m. Then r —t = 0 giving x — y = (q— s)m, so 
m | (%—y). 


Theorem 8.3.2. If x =m 2! and y=my', thenz+y=mz2't+y/’. 


Proof. From Lemma 8.3.1, m | (2 — 2’) and m | (y—y’). So m | ( 
x') + (y—y’)), which we can rearrange as m | ((x1 + y) — (a’ + y’)). Apply 
Lemma 8.3.1 in the other direction to get r+ y =m x+y’. 


Similarly, we can define —[z]m = [—Z]m and [Z]m- [ylm = [z- y]m. The 
same approach as in the proof of Theorem 8.3.2 shows that these definitions 
also give well-defined operations on residue classes.” 

All of the usual properties of addition, subtraction, and multiplication 
are inherited from Z: addition and multiplication are commutative and 
associative, the distributive law applies, etc. This makes Z, a commutative 
ring just like Z. 

To give a concrete example, Table 8.1 gives tables for the addition, 
multiplication, and negation operators in Zs. 

Using these tables, we can do arbitrarily horrible calculations in Zs using 
the same rules as in Z, e.g., 2-(1+3)—4 = 2-4—4 = 3-4 = 34 (—4) = 34+1=4 
(mod 5). We put the “ (mod 5)” at the end so that the reader won’t think 
we’ve gone nuts. 

The fact that [z]m+[ylm = [c+ y]m and [z]m x [ylm = [ry]m for all x and 
y means that the remainder operation x +> x mod m is a homomorphism 
from Z to Zp: it preserves the operations + and x on Z. The formal 
definition of a homomorphism that preserves an operation (say +) is a 


° For [—2]m: Suppose z =m 2’; then m | (x — x’), which implies m | (—(a — 2’)) or 
m | ((—x) — (—2’)), giving -r =, —2'. 

For [z]m-[y]m: Suppose x =m x’. Then m | (x — 2’) implies m | ((x — x’)y) implies 
m | (wy —2'x). So xy =n v'y. Applying the same argument shows that if y =n y’, 
Ly =m xy’. Transitivity can then be used to show ry =m 2'y =m 2’y’. 
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+/0 1 2 3 4 x}/0O 1 2 3 4 x} —2 
0;0 1 2 3 4 0/0 0 0 0 0 0} O 
1/1 2 3 4 O 1/0 1 2 3 4 1| 4 
2/2 3 4 0 1 2/0 2 4 1 8 2) 3 
3/3 4 0 1 2 3/0 3 1 4 2 3] 2 
4/4 0 1 2 8 4/0 4 3 2 1 4) 1 


Table 8.1: Arithmetic in Zs 


function f such that f(a+b) = f(a)+ f(b). The function x mod m preserves 
not only the operations + and x but the constants 0 and 1. This means 
that it doesn’t matter when you perform mod operations when converting a 
complicated expression in Z to the corresponding expression in Zp. You can 
do one big mod at the end, or you can do mods in the middle whenever it’s 
convenient, and you will get the same answer either way. 


8.4 Greatest common divisors 


Let m and n be numbers, where at least one of m and n is nonzero, and 
let & be the largest number for which k | m and k | n. Then k is called 
the greatest common divisor or ged of m and n, written gcd(m,n) or 
sometimes just (m,n). A similar concept is the least common multiple 
(lem), written lem(m,n), which is the smallest k such that m | k and n | k. 

Formally, g = gcd(m,n) if g | m, g | n, and for any g’ that divides both 
mand n, g’ |g. Similarly, 0 =1lem(m,n) if m | @, n| @, and for any @’ with 
m|@andn|@, £| £. 

Two numbers m and n whose gcd is 1 are said to be relatively prime 
or coprime, or simply to have no common factors. 

The divisibility relation is an example of a partial order (§9.5): a set 
with a relation < that is transitive, reflexive, and antisymmetric, but where 
there might be some x and y such that neither x < y nor y < x holds. If 
divisibility is considered as a partial order, the naturals form a lattice (see 
§9.5.3), which is a partial order in which every pair of elements x and y has 
both a unique greatest element that is less than or equal to both (the meet 
x A y, equal to gcd(z, y) in this case) and a unique smallest element that is 
greater than or equal to both (the join x V y, equal to lem(z, y) in this case). 
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8.4.1 The Euclidean algorithm for computing gcd(m,n) 


Euclid described in Book VII of his Elements what is now known as the 
Euclidean algorithm for computing the gcd of two numbers (his original 
version was for finding the largest square you could use to tile a given 
rectangle, but the idea is the same). Euclid’s algorithm is based on the 
recurrence 

n if m = 0, 


gcd(n mod m,m) if m> 0, 


gcd(m,n) = 


which holds whenever at least one of m and n is positive. 

The first case holds because n | 0 for all n. The second holds because if k 
divides both n and m, then k divides n mod m = n—|n/m|m; and conversely 
if k divides m and n mod m, then k divides n = (n mod m) + m|n/m|]. So 
(m,n) and (n mod m,m) have the same set of common factors, and the 
greatest of these is the same. 

So the algorithm simply takes the remainder of the larger number by the 
smaller recursively until it gets a zero, and returns whatever number is left. 


8.4.2 The extended Euclidean algorithm 


The extended Euclidean algorithm not only computes gcd(m,n), but 
also computes integer coefficients m’ and n’ such that 


m'm +n'n = gcd(m,n). 


This turns out to have several useful consequences, including the exis- 
tences of inverses for any a € Zm with gcd(a,m) = 1 and the fact that when 
p is prime, p | ab if and only if p| aor p|b. 

It has the same structure as the Euclidean algorithm, but keeps track of 
more information in the recurrence. Specifically: 


e For m= 0, gcd(m,n) =n with n’ = 1 and m’ = 0. 


e For m > 0, let n = qm+r where 0 <r < m, and use the algorithm 
recursively to compute a and 6 such that ar + bm = gcd(r,m) = 
gcd(m,n). Substituting r = n—qm gives gcd(m, n) = a(n—qm)+bm = 
(b—aq)m+an. This gives both the gcd and the coefficients m’ = b— aq 
and n/ =a. 
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Finding gcd(176,402) 
q=2 r= 50 
Finding gcd(50,176) 
q=3 r= 26 
Finding gcd(26,50) 
q=1 r= 24 
Finding gcd(24,26) 
qgq=i1r=2 
Finding gcd(2,24) 
q=12 r=0 
Finding gcd(0,2) 
base case 
Returning 0*0 + 1*2 = 2 
a= bi - al*q = 1 - 0*12 = 1 
Returning 1*2 + 0*24 = 2 
a= bi - al*q = 0 - 1*1 = -1 
Returning -1*24 + 1*26 = 2 
a= bl - al*q = 1 - -1*1 = 
Returning 2*26 + -1*50 = 2 
a= bl - al*q = -1 - 243 = -7 
Returning -7*50 + 2*176 = 2 
a= bi - al*q = 2 - -7*2 = 16 
Returning 16*176 + -7*402 = 2 


2 


Figure 8.1: Trace of extended Euclidean algorithm 


8.4.2.1 Example 


Figure 8.1 gives a computation of the gcd of 176 and 402, together with the 
extra coefficients. The code used to generate this figure is given in Figure 8.2. 


8.4.2.2 Applications 


e If gcd(n,m) = 1, then there is a number n’ such that nn’ + mm! = 
1, which means nn’ = 1 (mod m). This number n’ is called the 
multiplicative inverse of n mod m and acts much like 1/n when 
doing modular arithmetic (see §8.6.1). 


e If pis prime and p| ab, then either p | a or p| b. Proof: suppose p { a; 
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#!/usr/bin/python3 


def euclid(m, n, trace = False, depth = 0): 
"""Tmplementation of extended Euclidean algorithm. 


Returns triple (a, b, g) where am + bn = g and g = gcd(@, n). 
Optional argument trace, if true, shows progress.""" 


def output(s): 
if trace: 
print("{}{}".format(’? ’ * depth, s)) 


output ("Finding gcd({},{})".format(m, n)) 


if m == 
output ("base case") 
a, b, g=0, 1, n 
else: 
q = n//m 
r=n%,m 
output("q = {} r = {}".format(q, r)) 


al, bl, g = euclid(r, m, trace, depth + 1) 
a= bl - al*q 
b = al 


output("a = b1 - al*q = {} - {}*{} = {}".format(b1, al, q, a)) 


output ("Returning {}*{} + {}*{} = {}".format(a, m, b, n, g)) 
return a, b, g 


if __name == ? main’: 


import sys 


euclid(int(sys.argv[1]), int(sys.argv[2]), True) 


Figure 8.2: Python code for extended Euclidean algorithm 
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since p is prime we have gcd(p,a) = 1. So there exist r and s such 
that rp + sa = 1. Multiply both sides by } to get rpb+ sab = b. Then 
p | rpb and p | sab (the latter because p | ab), so p divides their sum 
and thus p | b. This is a key tool for proving the Fundamental Theorem 
of Arithmetic (§8.5), and also shows that Z, has no zero divisors, 
nonzero numbers a and b such that ab = 0.° 


8.5 The Fundamental Theorem of Arithmetic 


Let n be a number greater than 0. Then there is a unique sequence of 
primes p, < po <...< px such that n = pip2...p,z. This fact is known as 
the Fundamental Theorem of Arithmetic, and the sequence p; ... px is 
called the prime factorization of n. 

Showing that there is at least one such sequence is an easy induction 
argument. If n = 1, take the empty sequence; by convention, the product 
of the empty sequence is the multiplicative identity, 1. If n is prime, take 
p, = n; otherwise, let n = ab where a and b are both greater than 1. Then 
nN = p,.--PrQ1---Gm Where the p; are the prime factors of a and the q; 
are the prime factors of b. Unfortunately, this simple argument does not 
guarantee uniqueness of the sequence: it may be that there is some n with 
two or more distinct prime factorizations. 

We can show that the prime factorization is unique by an induction 
argument that uses the fact that p | ab implies p | a or p | b, which we 
proved using the extended Euclidean algorithm in §8.4.2.2. If n = 1, then 
any non-empty sequence of primes has a product greater than 1; it follows 
that the empty sequence is the unique factorization of 1. If n is prime, any 
factorization other than n alone would show that it isn’t; this provides a 
base case of n = 2 and n = 3 as well as covering larger values of n that are 
prime. So suppose that n is composite, and that n = p,...per = q1---m; 
where {p;} and {q;} are nondecreasing sequences of primes. Suppose also (by 
the induction hypothesis) that any n’ < n has a unique prime factorization. 

If py = q1, then po... pe = q2---Gm, and so the two sequences are identical 
by the induction hypothesis. Alternatively, suppose that p; < qi; note that 
this also implies p, < q for all i, so that p; doesn’t appear anywhere in 
the second factorization of n. But then p doesn’t divide q,...dm = 1, a 
contradiction. 


°This is not true in Z, when m is not prime, since we can just pick a and 6 so that 
ab=™m. 
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8.5.1 Unique factorization and gcd 


Using unique factorization, we can compute gcd(a,b) by factoring both a 
and 6 and retaining all the common factors, which is the algorithm favored 
in elementary school when dealing with small numbers. Without unique 
factorization, this wouldn’t work: we might get unlucky and factor a or b 
the wrong way so that the common factors didn’t line up. For very large 
numbers, computing prime factorizations becomes impractical, so Euclid’s 
algorithm is a better choice. 

Similarly, for every a and b, we can compute the least common multiple 
Icm(a,b) by taking the maximum of the exponents on each prime that 
appears in the factorization of a or b. (It can also be found by computing 
lem(a, b) = ab/ gcd(a, b), which is more efficient for large a and b because we 
don’t have to factor.) 

One way to look at this is that we can represent any n in Nt as a se- 
quence of exponents, where the i-th element in the sequence is the exponent 
on the i-th prime. So, for example, 24 = 2° - 3! would be represented by 
the sequence (3,1,0,0,0,...) and 675 = 3° - 5? would be represented by 
(0,3,2,0,0,0,...). Taking the gcd of two numbers corresponds to taking 
the componentwise min of the corresponding sequences, while taking the 
lcm corresponds to the componentwise max. This has gcd(24,675) rep- 
resented by (0,1,0,0,0,0,...) = 5! = 5 and lem(24,675) represented by 
8, 3,200.0 y.0) =o" <9" 5" = 500. 


8.6 More modular arithmetic 


Here we will look a little more closely at the structure of Z,,, then integers 
mod m. 


8.6.1 Division in Z,, 


One thing we don’t get general in Z, is the ability to divide. This is not 
terribly surprising, since we don’t get to divide (without remainders) in 
Z either. But for some values of x and m we can in fact do division: for 
these x and m there exists a multiplicative inverse z~! (mod m) such 
that xx~'= 1 (mod m). We can see the winning x’s for Zg by looking for 
ones in the multiplication table for Zg, given in Table 8.2. 

Here we see that 1~! = 1, as we’d expect, but that we also have 2~! = 5, 
4-'=7,5-1=2, 77! =4, and 8°! =8. There are no inverses for 0, 3, or 6. 
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Table 8.2: Multiplication table for Zg 


What 1, 2, 4, 5, 7, and 8 have in common is that they are all relatively 
prime to 9. This is not an accident: when gcd(z,m) = 1, we can use the 
extended Euclidean algorithm (§8.4.2) to find «~! (mod m). Observe that 
what we want is some x’ such that xx’ =, 1, or equivalently such that 
x'x +qm=1 for some qg. But the extended Euclidean algorithm finds such 
an x’ (and q) whenever gcd(z,m) = 1. 

If gcd(x,m) #1, then x has no multiplicative inverse in Z,,. The reason 
is that if some d > 1 divides both x and m, it continues to divide xz’ and 
m for any x’ 4 0. So in particular rz’ can’t be congruent to 1 mod m since 
qm +1 and m don’t share any common factors for any value of q. 

The set of of residue classes [x],, where gcd(x,m) = 1 is written as Z*,. 
For a prime p, Z,, includes all non-zero elements of Z,, since gcd(x, p) = 1 for 
any x that is not 0 or a multiple of p. This means that Z, satisfies the same 
field axioms (§4.1) as Q or R: in Z,, we can add, subtract, multiply, and 
divide by any number that’s not 0, and all of these operations behave the way 
we expect. However, Zp is not an ordered field, since the fact that numbers 
wrap around means that we can’t define a < relation that is invariant with 
respect to translation or scaling. 

Unlike addition, subtraction, and multiplication, division in Z, doesn’t 
project down from a corresponding operation in Z or Q, the way that addition, 
subtraction, and multiplication do. So while we can write 3 = $4-l=34=9 
(mod 5), if we compute 3 first (in Q, say), there is no natural mapping from 
Q to Zs that sends 3 to 2.7 


"While we could define a mapping f(p/q) = (p mod 5)(q mod 5)~1 that would work for 
many rationals, the problem is what to do with fractions like 3. 
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8.6.2 The Chinese Remainder Theorem 
In the form typically used today, the Chinese Remainder Theorem® 


(CRT for short) looks like this: 


Theorem 8.6.1 (Chinese Remainder Theorem). Let m1 and mz be relatively 
prime.? Then for each pair of equations 


nmodm, =n, 


n mod m2 = na, 


there is a unique solution n withO <n < mymg. 


We'll defer the proof for a moment and give an example to show what 
the theorem means. Suppose m1 = 3 and m2 = 4. Then the integers n from 
0 to 11 can be represented as pairs (n1,n2) with no repetitions as follows: 


’The earliest known written version of the theorem appeared in The Mathematical 
Classic of Sunzi, a Chinese text of uncertain date but probably from around the fifth 
century. For this reason the theorem is often called Sunzi’s Remainder Theorem in Chinese. 

The name “Chinese Remainder Theorem” as used in English is much more re- 
cent, having been popularized in 1929 by an American mathematician, who learned 
about the result from a nineteenth-century translation of a version written in 
1247 by the Chinese mathematician Qin Jiushao but didn’t see any need to men- 
tion any particular person responsible for it. Theodore Kim gives some context 
for this choice and its possible connections to early twentieth-century American 
anti-Chinese racism in an op-ed at https://www.washingtonpost.com/outlook/ 
2021/12/08/racism-our-curriculums-isnt-limited-history-its-math-too/. 

A more detailed discussion of the history of the theorem and _ its 
naming can be found at http://mathoverflow.net/questions/11951/ 
what-is-the-history-of-the-name-chinese-remainder-theorenm. 

As is often the case with old mathematical results, the theorem was known for some time 
before its first convincing proof. The sixth-century Indian mathematician Aryabhata gave 
an algorithm for computing solutions (while taking a break from inventing trigonometry), 
and the most general version appears to have been first proven by Qin Jiushao. 

°This means that gced(mi, m2) = 1. 
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This gives a factorization of Z 2 as Z3 x Z4. This doesn’t just mean that 
we can represent elements of Zj2 as pairs of elements in Z3 x Z4; since this 
factorization is an isomorphism, we can do arithmetic on these pairs and 
get the same answers as if we did the arithmetic in Z12. For example, the 
element 7 of Zi is represented by the pair (1,3) in Z3 x Z4, and similarly 5 
is represented by (2,1). So to multiply 7-5 in Zi2, we can instead multiply 
(1,3) by (2,1) componentwise in Z3 x Z4. This gives (1-2,3-1) = (2,3), 
which from the above list we can see corresponds to 11 in Zj2. This matches 
the result we get from 7-5 = 35 = 11 (mod 12). 


Proof. We'll show an explicit algorithm for constructing the solution. The 
first trick is to observe that if a | b, then (x mod b) moda = x moda. 
The proof is that x modb = x — qb for some q, so (a mod b) moda = 
(a mod a) — (qb mod a) = x mod a since any multiple of b is also a multiple 
of a, giving qb mod a = 0. 

Since m, and mz are relatively prime, the extended Euclidean algorithm 
gives m/, and m!, such that mimi = 1 (mod m2) and mm2 = 1 (mod mj). 
Let n = (nimhmz2 + namm1) mod mime. Then 


n mod m4 (nym,mz + ngm,m 1) mod mymz2) mod m4, 


= ( 
= (nymym2 + ngmm,1) mod m1 


= (ny- 14+ ngm{,-0) mod m, 


A nearly identical calculation shows n mod m2 = ng as well. 
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The intuition is that m4mz acts like 1 mod m, and 0 mod mzg, and vice 
versa for m'm;. Having found these basic solutions for (1,0) and (0,1), 
solutions for arbitrary (n,,2) are just a matter of adding up enough of each. 

That the solution for each pair is unique can be shown by counting: there 
are m mz possible choices for both pairs (n1,n2) and solutions n, so if some 
pair has more than one solution, some other pair must have none.'° But we 
have just given an algorithm for generating a solution for any pair. 


The general version allows for any number of equations, as long as the 
moduli are pairwise relatively prime, which means that each pair of moduli 
have a gcd of 1. The full result is: 


Theorem 8.6.2 (Chinese Remainder Theorem (general version)). Let m1,...,™ 
satsify gcd(m;,m,;) = 1 for all pairs i,j with i A j. Then any system of 
equations 

n=n, (mod m) 


has a unique solution n with 0 <n < J]; mi. 


Proof. The solution can be computed using the formula 


n= So ni [[(m;" (mod m;))m,; | mod [[™- 
i Hi i 

As in the two-modulus case, the factor (m5" (mod mj;))m;, where m; 

(mod m,) is the multiplicative inverse of m; mod mj, acts like 1 mod m; and 

0 mod mj. So for any fixed k, 


n mod my = S- Ni [[(™;" (mod m,))m; | mod [[™ mod mg, 
i Hi i 
= S> ny [[(™;"' (mod m,;))m,; | mod m, 
i Hi 


nee lt So (ni -0) | mod mz 
izh 
= Nk. 


Uniqueness again follows by counting. 


This is an application of the Pigeonhole Principle (§11.1.3.2), which works for finite 
sets. 
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The full CRT gives a solution to any collection of equations of the 
appropriate form. If we take just the uniqueness part, it tells us that any 
two such solutions are equivalent mod m1m2...mxz. This can be expressed 
as: 


Corollary 8.6.3. Let m4,...,m, be pairwise relatively prime, and for all 
i€{l,...,k}, let 


x=y (mod m). 
Then 


a=y (mod I*_,m). 


8.6.3 The size of Z*, and Euler’s Theorem 


Recall that Z*, is the set of numbers 0 < k < m such that gced(m, k) = 1, or 

equivalently the set of elements of Z that have multiplicative inverses. 
The size of Z*, is written ¢(m) and is called Euler’s totient function or 

just the totient of m. When pis prime, gcd(n, p) =1forallnwithO<n<p, 


so 6(p) = 


unless p is n. “There are exactly p*—! numbers less than p* that are divisible 
by p (they are 0,p, 2p,...(p* — 1)p), so 6(p*) = p* — p*"! = p*"1(p -1).4 
For composite numbers m that are not prime powers, finding the value of 
@(m) is more complicated; but we can show using the Chinese Remainder 
Theorem (Theorem 8.6.1) that in general 


k 
(TL) = [Lee — 0. 
i=1 


One reason ¢(m) is important is that it plays a central role in Euler’s 
Theorem: 


= p—1. Fora prime power p*, we similarly have ged(n, p*) = 1 
k—-1 


Theorem 8.6.4. Let gcd(a,m) = 1. Then 
a) —1 (mod m). 


Proof. We will prove this using an argument adapted from the proof of [Big02, 
Theorem 13.3.2]. Let 21, z2,---,2(m) be the elements of Z>,. For any y € Z7,, 


define yZ*, = {yzu,yz2, ats iW 2atea) b Since y has a multiplicative inverse 


"Note that $(p) = ¢(p') = p'~'(p — 1) = p— 1 is actually a special case of this. 
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mod m, the mapping z +> yz (mod m) is a bijection, and so yZ*, = Z*, 
(mod m). It follows that []; 2: = Tl; yz: = y®™ TI; 2% (mod m). But now 
multiply both sides by ([]; 2)’ = J]; 27 to get 1 = y®™ (mod m) as 
claimed. 


For the special case that m is a prime, Euler’s Theorem is known as 
Fermat’s Little Theorem, and says that a?~' = 1 (mod p) for all primes 
p and all a such that p { a. Fermat proved this result before Euler generalized 
it to composite m, which is why we have two names. 


8.7 RSA encryption 


Euler’s Theorem is useful in cryptography. For example, the RSA en- 
cryption system is based on the fact that (2°)? = 2 (mod m) when m 
is the product of two distinct primes p and q , de = 1 (mod ¢(m)), and 
0<a<m.' So x can be encrypted by raising it to the e-th power mod m, 
and decrypted by raising the result to the d-th power. It is widely believed 
that publishing e and m reveals no useful information about d provided e 
and m are chosen carefully. 

Specifically, the person who wants to receive secret messages chooses 
large primes p and q, and finds d and e such that de = 1 (mod (p—1)(q—1)). 
They then publish m = pq (the product, not the individual factors) and e. 

Encrypting a message x involves computing x° mod m. If x and e are 
both large, computing x© and then taking the remainder is an expensive 
operation; but it is possible to get the same value by computing x° in stages 
by repeatedly squaring x and taking the product of the appropriate powers. 
To decrypt x°, compute (2°) mod m. 


This is not quite immediate from Euler’s Theorem, because Euler’s Theorem only says 
that 2°" = 1 (mod m) when gcd(a,m) = 1. But we can use the Chinese Remainder 
Theorem to prove x? = 2*®-YG@-D+1 — x (mod m) holds even if ged(x,pq) # 1, as 
long as p # q and both are prime. The idea is that Zpq factors as Zp x Zg, so we 
can represent x € Zp as a pair (%p,X%q) where x» = x mod p and x, = x modq. Then 


de 
Pp 


also 0; or a» #0 (mod p) and Euler’s Theorem gives x?~' = 1 (mod p). Since the same 
thing works on the q side, we get oe > = (£p,%q), and thus «*° = x by CRT. 

For large p and q, that RSA works even with gcd(z,pq) = 1 is a bit of a curiosity, 
since for gcd(x, pq) not to be 1 then either p| x or q| x. Not only is this spectacularly 
improbable, but if we do happen to find such an xz, we break the encryption: by taking 
gcd(a,m) we recover one of the factors of m, and now we can find both and compute d. So 
most analyses of RSA just assume gcd(xz,m) = 1 and use Euler’s Theorem directly with 


modulus m. 


Lp = aay ‘Lp = Lp» mod p, because either x» = 0 (mod p) and the product is 
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For example, let p = 7, ¢q = 13, som = 91. The totient ¢(m) of m is 
(p —1)(q—1) = 6-12 = 72. Next pick some e relatively prime to ¢(m): 
e=5. Since 5-29 = 72-2+1 we can make d = 29. Note that to compute d 
in this way, we needed to know how to factor m so that we could compute 
(p — 1)(q — 1); it’s not known how to find d otherwise. 

Now let’s encrypt a message. Say we want to encrypt 11. Using e = 5 
and m = 91, we can compute: 


117 =11 

117 = 121 = 30 

114 = 30? = 900 = 81 

11° = 11*-111 = 81-11 = 891 = 72. 


When the recipient (who knows d) receives the encrypted message 72, 
they can recover the original by computing 727? mod 91: 


72! = 72 

72? — 5184 = 88 

72° = 88? = (8/7 =9 

72° = 9? = 81 

725 = 81? = (—10)? =100=9 

(2)? = 79 6 72? 79" 72 SH 9 8120s HSI 2 = 9 SH 64S = 11, 


Note that we are working in Z9, throughout. This is what saves us from 
computing the actual value of 7279 in Z,!° and only at the end taking the 
remainder. 

For actual security, we need m to be large enough that it’s hard to 
recover p and q using presently conceivable factoring algorithms. Typical 
applications choose m in the range of 2048 to 4096 bits, so that each of p 
and q will be a random prime between roughly 10°98 and 10°!’. This is too 
big to show a hand-worked example, or even to fit into the much smaller 
integer data types shipped by default in many programming languages, but 
it’s not too large to be able to do the computations efficiently with good 
large integer arithmetic library. 


1SIf you're curious, it’s 728857113063526668247098229876984590549890725463457792. 


Chapter 9 


Relations 


A binary relation from a set A to a set B is a subset of A x B. In general, 
an n-ary relation on sets Aj, Ao,..., An is a subset of Ay x Ag x... x An. 
We will mostly be interested in binary relations, although n-ary relations 
are important in databases. Unless otherwise specified, a relation will be a 
binary relation. A relation from A to A is called a relation on A; many of 
the interesting classes of relations we will consider are of this form. Some 
simple examples are the relations =, <, <, and | (divides) on the integers. 

You may recall that functions are a special case of relations, but most of 
the relations we will consider now will not be functions. 

Binary relations are often written in infix notation: instead of writing 
(x,y) € R, we write xRy. This should be pretty familiar for standard 
relations like < but might look a little odd at first for relations named with 
capital letters. 


9.1 Representing relations 


In addition to representing a relation by giving an explicit table ({(0, 1), (0, 2), (1, 2)}) 
or rule (~Ry if x < y, where x,y € {0,1,2}), we can also visualize relations 
in terms of other structures built on pairs of objects. 


9.1.1 Directed graphs 


A directed graph consists of a set of vertices V and a set of edges EF, 
where each edge E has an initial vertex or source init(e) and a terminal 
vertex or sink term(/). A simple directed graph has no parallel edges: 
there are no edges e; and e2 with init(e,) = init(e2) and term(e,) = term(e2). 


132 
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Figure 9.1: A directed graph 


Figure 9.2: Relation {(1,2), (1,3), (2,3), (3,1)} represented as a directed 
graph 


If we don’t care about the labels of the edges, a simple directed graph 
can be described by giving E as a subset of V x V; this gives a one-to-one 
correspondence between relations on a set V and (simple) directed graphs. 
For relations from A to B, we get a bipartite directed graph, where all 
edges go from vertices in A to vertices in B. 

Directed graphs are drawn using a dot or circle for each vertex and an 
arrow for each edge, as in Figure 9.1. 

This also gives a way to draw relations. For example, the relation 
on {1,2,3} given by {(1, 2), (1,3), (2,3), (8,1)} can be depicted as show in 
Figure 9.2. 

A directed graph that contains no sequence of edges leading back to 
its starting point is called a directed acyclic graph or DAG. DAGs are 
important for representing partially-ordered sets (see §9.5). 


9.1.2 Matrices 


A matrix is a two-dimensional analog of a sequence: in full generality, it is 
a function A: S x T > U, where S and T are the index sets of the matrix 
(typically {1...n} and {1...m} for some n and m). As with sequences, we 
write A;; for A(i, 7). Matrices are typically drawn inside square brackets like 


CHAPTER 9. RELATIONS 134 


this: 
011 0 
A=!/2 1 0 0 
100 -1 

The first index of an entry gives the row it appears in and the second one 
the column, so in this example Ag = 2 and A34 = —1. The dimensions of 
a matrix are the numbers of rows and columns; in the example, A is a 3 x 4 
(pronounced “3 by 4”) matrix. 

Note that rows come before columns in both indexing (Ajj: i is row, j 
is column) and giving dimensions (n x m: n is rows, m is columns). Like 
the convention of driving on the right (in many countries), this choice is 
arbitrary, but failing to observe it may cause trouble. 

Matrices are used heavily in linear algebra (Chapter 13), but for the 
moment we will use them to represent relations from {1...n} to {1...m}, 
by setting Aj; = 0 if (i,7) is not in the relation and Aj; = 1 if (i,7) is. So 
for example, the relation on {1...3} given by {(i,7) | 7 < 7} would appear 
in matrix form as 
1 1 
0 1 

0 0 0 

When used to represent the edges in a directed graph, a matrix of this 

form is called an adjacency matrix. 


9.2 Operations on relations 


9.2.1 Composition 


Just like functions, relations can be composed: given relations R C A x B 
and S C Bx C we define (So R) C Ax C by the rule (x, z) € (So R) if and 
only if there exists some y € B such that (x,y) € R and (y,z) € S. (In infix 
notation: z(So R)z 4 dy: «Ry A ySz.) It’s not hard to see that ordinary 
function composition ((f o g)(x) = f(g(x))) is just a special case of relation 
composition. 


In matrix terms, composition acts like matrix multiplication, where 
we replace scalar multiplication with AND and scalar addition with OR: 
(So R)ij = Vel Riz A Sk3). Note that if we use the convention that Rij = 1 if 
iRj the order of the product is reversed from the order of composition. 

Composition is associative: (Ro S) oT = Ro(SoT) for any relations 
for which the composition makes sense. (This is easy but tedious to prove.) 
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For relations on a single set, we can iterate composition: R” is defined 
by R° = (=) and R™*! = Ro R”. (This also works for functions, bearing 
in mind that the equality relation is also the identity function.) In directed 
graph terms, «Ry if and only if there is a path of exactly n edges from x to 
y (possibly using the same edge more than once). 


9.2.2 Inverses 


Relations also have inverses: xR-ly < yRz. Unlike functions, every 
relation has an inverse. 


9.3 Classifying relations 


Certain properties of relations on a set are important enough to be given 
names that you should remember. 


Reflexive A relation R on a set A is reflexive if (a,a) is in R for all a in 
A. The relations = and < are both reflexive; < is not. The equality 
relation is in a sense particularly reflexive: a relation R is reflexive if 
and only if it is a superset of =. 


Symmetric A relation R is symmetric if (a,b) is in R whenever (6, a) is. 
Equality is symmetric, but < is not. Another way to state symmetry 
is that R= R-!. 


Antisymmetric A relation R is antisymmetric if the only way that both 
(a,b) and (b,a) can be in R is if a= b. (More formally: aRb A bRa > 
a=.) The “less than” relation < is antisymmetric: if a is less than 6, 
bis not less than a, so the premise of the definition is never satisfied. 
The “less than or equal to” relation < is also antisymmetric; here it 
is possible for a < b and 6 < a to both hold, but only if a = b. The 
set-theoretic statement is R is symmetric if and only if RN R7! C (=). 
This is probably not as useful as the simple definition. 


Transitive A relation R is transitive if (a,b) in R and (b,c) in R implies 
(a,c) in R. The relations =, <, and < are all transitive. The relation 
{(z,x +1) | « € N} is not. The set-theoretic form is that R is transitive 
if R? C R, or in general if R” C R for all n > 0. 
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9.4 Equivalence relations 


An equivalence relation is a relation that is reflexive, symmetric, and 
transitive. Equality is the model of equivalence relations, but some other 
examples are: 


e Equality mod m: The relation x = y (mod m) that holds when x 
and y have the same remainder when divided by m is an equivalence 
relation. This is often written as x =, y. 


e Equality after applying a function: Let f : A — B be any function, 
and define « ~y y if f(x) = f(y). Then ~ ¢ is an equivalence relation. 
Note that =,, is a special case of this. 


e Membership in the same block of a partition: Let A be the union of 
a collection of sets A; where the A; are all disjoint. The set {A;} is 
called a partition of A, and each individual set A; is called a block 
of the partition. Let x ~ y if x and y appear in the same block A; for 
some 7. Then ~ is an equivalence relation. 


e Product equivalence relations: If ~4 is an equivalence relation on A, 
and ~zg is an equivalence relation on B, then ~4, p is the equivalence 
relation on A x B defined by (a,b) ~4xp (a’, 0’) if and only if a ~, a’ 
and bw, Ob’. 


e Directed graph isomorphism: Suppose that G = (V,E) and G’ = 
(V’', E’) are directed graphs, and there exists a bijection f : V > V’ 
such that (u,v) is in F if and only if (f(u), f(v)) is in BE’. Then G 
and G’ are said to be isomorphic (from Greek “same shape”). The 
relation G & G’ that holds when G and G’ are isomorphic is easily 
seen to be reflexive (let f be the identity function), symmetric (replace 
f by f7'), transitive (compose f : G— G! and g: G’ > G"); thus it 
is an equivalence relation. 


e Partitioning a plane: draw a curve in a plane (i.e., pick a continuous 
function f : [0,1] > R?). Let x ~ y if there is a curve from x to y 
(i.e., a curve g with g(0) = x and g(1) = y) that doesn’t intersect the 
first curve. Then x ~ y is an equivalence relation on points in the 
plane excluding the curve itself. Proof: To show x ~ z, let g be the 
constant function g(t) = x. To show « ~ y © y ~ a, consider some 
function g demonstrating x ~ y with g(0) = x and g(1) = y and let 
g(t) = g(1—-t). To show x ~ y and y ~ z implies x ~ z, let g bea 
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curve from x to y and g’ a curve from y to z, and define a new curve 
(g +9’) by (9+ 9')(t) = g(2t) when t < 1/2 and (g + g')(t) = g'(2t — 1) 
when t > 1/2. 


Any equivalence relation ~ on a set A gives rise to a set of equivalence 
classes, where the equivalence class of an element a is the set of all b such 
that a ~ b. Because of transitivity, the equivalence classes form a partition 
of the set A, usually written A/ ~ (pronounced “the quotient set of A by 
~, “A slash ~,” or sometimes “A modulo ~”). A member of a particular 
equivalence class is said to be a representative of that class. For example, 
the equivalence classes of equality mod m are the sets [i], = {i + km | k € N}, 
with one collection of representatives being {0,1,2,3,...,m-—41}. A more 
complicated case are the equivalence classes of the plane partitioning example; 
here the equivalence classes are essentially the pieces we get after cutting out 
the curve f, and any point on a piece can act as a representative for that 
piece. 

This gives us several equally good ways of showing that a particular 
relation ~ is an equivalence relation: 


Theorem 9.4.1. Let ~ be a relation on A. Then each of the following 
conditions implies the others: 


1. ~ ts reflexive, symmetric, and transitive. 


2. There is a partition of A into disjoint equivalence classes A; such 
that x ~ y if and only if x € A; and y € A; for some 1. 


3. There is a set B and a function f: A— B such that x ~ y if and only 
if f(x) = f(y). 
Proof. We do this in three steps: 


e (1 > 2). For each z € A, let Az = [z]~ = {ye Al y~ ch}, and let 
the partition be {A, | z € A}. (Note that this may produce duplicate 
indices for some sets.) By reflexivity, « € A, for each x, so A = U, Ar. 


To show that distinct equivalence classes are disjoint, suppose that 
A, Ay #0. Then there is some z that is in both A; and A,, which 
means that z ~ x and z ~ y; symmetry reverses these to get x ~ z 
andy ~ z. Ifqe Az, theng~x~z~y, giving q € Ay; conversely, if 
q€ Ay, then g~y~z € a, giving q € Az. It follows that A; = Ay. 


e (2 > 3). Let B= A/ ~ = {A,}, where each A; is defined as above. 
Let f(z) = Az. Then x ~ y implies x € A, implies Az Ay 4 0. We’ve 
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shown above that if this is the case, A, = Ay, giving f(x) = f(y). 
Conversely, if f(x) # f(y), then A; # Ay, giving A, M Ay = 9. In 
particular, x € A, means x ¢ Ay, sox # y. 


e (3 > 1). Suppose x ~ y if and only if f(a) = f(y) for some f. Then 
f(z) = f(z), soz~z: (+) is reflexive. If c ~ y, then f(x) = f(y), 
giving f(y) = f(x) and thus y ~ a: (~) is symmetric. If x ~ y ~ z, 
then f(x) = f(y) = f(z), and f(x) = f(z), giving x ~ z: (+) is 


transitive. 


9.4.1 Why we like equivalence relations 


Equivalence relations are the way that mathematicians say “I don’t care.” 


If you don’t care about which integer you’ve got except for its remainder 
when divided by m, then you define two integers that don’t differ in any 
way that you care about to be equivalent and work in Z/ =,,. This turns 
out to be incredibly useful for defining new kinds of things: for example, we 
can define multisets (sets where elements can appear more than once) by 
starting with sequences, declaring x ~ y if there is a permutation of x that 
reorders it into y, and then defining a multiset as an equivalence class with 
respect to this relation. 

This can also be used informally: “I’ve always thought that broccoli, 


spinach, and kale are in the same equivalence class.”! 


9.5 Partial orders 


A partial order is a relation < that is reflexive, transitive, and antisymmet- 
ric. The last means that if x < y and y < z, then x = y. A set S together 
with a partial order < is called a partially ordered set or poset. A strict 
partial order is a relation < that is irreflexive (x ¢ x) and transitive. Any 
partial order < can be converted into a strict partial order < and vice versa 
by deleting/including the pairs (x,x) for all x. This is equivalent to our 
usual definition of « < y if and only ifa<yandaFy. 

A total order is a partial order < in which any two elements are 
comparable. This means that, given x and y, either x < yory<a. A 
poset (.S,<) where < is a total order is called totally ordered. Not all 


‘Curious fact: two of these unpopular vegetables are in fact. cultivars of the same species 
Brassica oleracea of cabbage. 
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partial orders are total orders; for an extreme example, the poset (S,=) for 
any set S with two or more elements is partially ordered but not totally 
ordered. 

Examples: 


(N, <) is a poset. It is also totally ordered. 


(N, >) is also both partially ordered and totally ordered. In general, 
if R is a partial order, then R~! is also a partial order; similarly for 
total orders. This property is known as duality and has the very nice 
consequence that any concept we can define in terms of a partial order 
< has a corresponding concept defined in terms of its inverse >. 


The divisibility relation a | b on natural numbers, where a | b if and 
only if there is some & in N such that b = ak, is reflexive (let k = 1), 
antisymmetric (if a | b, then a < b, so ifa| band b | athena < b 
and b < a implying a = b) and transitive (if b = ak and c = bk’ then 
c = akk'). Thus it is a partial order. 


Let (A,<,4) and (B,<g) be posets. Then the relation < on A x B 
defined by (a,b) < (a’,b’) ff and only if a < a’ and b < U’ is a partial 
order. The poset (A x B,<) defined in this way is called the product 
poset of A and B. 


Again let (A,<4) and (B,<z) be posets. The relation < on A x B 
defined by (a,b) < (a’,b’) if either (1) a <a’ or (2) a=a’' andb< JU 
is called lexicographic order on A x B and is a partial order. The 
useful property of lexicographic order (lex order for short) is that if 
the original partial orders are total, so is the lex order: this is why 
dictionary-makers use it. This also gives a source of very difficult-to- 
visualize total orders, like lex order on R x R, which looks like the 
classic real number line where every point is replaced by an entire copy 
of the reals. 


Let © be some alphabet and consider the set ©* = N° UX! U™?... of 
all finite words drawn from %. Given two words x and y, let x < y if x 
is a prefix of y, i.e. if there is some word z such that xz = y. Then 
(&*, <) is a poset. 


Using the same set %*, let « E y if x is a subsequence of y, i-e., if 
there is a sequence of increasing positions 71 < ig <--- <i, such that 
2; = yi; (For example, bd E abcde.) Then (%*,€) is a poset. 


CHAPTER 9. RELATIONS 140 


There are also some common relations that are not partial orders or 
strict partial orders but come close. For example, the element-of relation (€) 
is irreflexive and antisymmetric (this ultimately follows from the Axiom of 
Foundation) but not transitive; if z € y and y € z we do not generally expect 
x €z. The “is at least as rich as” relation is reflexive and transitive but not 
antisymmetric: if you and I have a net worth of 0, we are each as rich as the 
other, and yet we are not the same person. Relations that are reflexive and 
transitive (but not necessarily antisymmetric) are called quasi-orders or 
pre-orders and can be turned into partial orders by defining an equivalence 
relation x ~ y if « < y and y < « and replacing each equivalence class with 
respect to ~ by a single element. 

As far as I know, there is no standard term for relations that are irreflexive 
and antisymmetric but not necessarily transitive. 


9.5.1 Drawing partial orders 


Since partial orders are relations, we can draw them as directed graphs. But 
for many partial orders, this produces a graph with a lot of edges whose 
existence is implied by transitivity, and it can be easier to see what is going 
on if we leave the extra edges out. If we go further and line the elements 
up so that x is lower than y when x < y, we get a Hasse diagram: a 
representation of a partially ordered set as a graph where there is an edge 
from x to y if x < y and there is no z such that x < z < y.” 

Figure 9.3 gives an example of the divisors of 12 partially ordered by 
divisibility, represented both as a digraph and as a Hasse diagram. Even in 
this small example, the Hasse diagram is much easier to read. 


9.5.2. Comparability 


In a partial order, two elements x and y are comparable if x < y or y < =. 
Elements that are not comparable are called incomparable. In a Hasse 
diagram, comparable elements are connected by a path that only goes up. 
For example, in Figure 9.3, 3 and 4 are not comparable because the only 
paths between them requiring going both up and down. But 1 and 12 are 
both comparable to everything. 


?There is special terminology for this situation: such an x is called a predecessor 
or sometimes immediate predecessor of y; y in turn is a successor or sometimes 
immediate successor of z. 
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Figure 9.3: Factors of 12 partially ordered by divisibility. On the left is the 
full directed graph. On the right is a Hasse diagram, which uses relative 


height instead of arrowheads to indicate direction and omits edges implied 
by reflexivity and transitivity. 


9.5.3 Lattices 


A lattice is a partial order in which (a) each pair of elements x and y has a 
unique greatest lower bound or meet, written x A y, with the property 
that (r§Ay) < x, (xAy) < y, and z < (xAy) for any z with z < x and z < y; 
and (b) each pair of elements x and y has a unique least upper bound or 
join, written x V y, with the property that (x V y) > 2, (a V y) > y, and 
z>(aVy) for any z with z > x and z > y. Meet and join are duals of 
each other: the definition of join is obtained from the definition of meet by 
replacing < with >. 

Examples of lattices are any total order (x A y is min(z,y), x V y is 
max(2,y)), the subsets of a fixed set ordered by inclusion (4 A y is Ny, 
xVyis xUy), and the divisibility relation on the positive integers (x A y 
is the greatest common divisor, x V y is the least common multiple—see 
Chapter 8). Products of lattices with the product order are also lattices: 
(a1, 02)A(y1, y2) = (@1A1y1, Z2Ay2) and (21, 22)V(y1, y2) = (Z1Viy1, T2Vy2).” 


°The product of two lattices with lexicographic order is not always a lattice. For example, 
consider the lex-ordered product of ({0,1},C) with (N,<). For the elements x = ({0} ,0) 
and y = ({1},0), we have that z is a lower bound on x and y if and only if z is of the form 
(0,k) for some k € N. But there is no greatest lower bound for 2 and y because given any 
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Figure 9.4: Maximal and minimal elements. In the first poset, a is minimal 
and a minimum, while 6 and c are both maximal but not maximums. In the 
second poset, d is maximal and a maximum, while e and f are both minimal 
but not minimums. In the third poset, g and h are both minimal, 7 and j 
are both maximal, but there are no minimums or maximums. 


9.5.4 Minimal and maximal elements 


If for some x, y < x only if y = a, then x is minimal. Equivalently, x is 
minimal if there is no y such that y < x. 

A partial order may have any number of minimal elements. The integers 
have no minimal element, the naturals have one minimal element, and a set 
with & elements none of which are comparable to each other has k minimal 
elements. 

If an element x satisfies x < y for all y, then x isa minimum. A partial 
order may have at most one minimum (for example, 0 in the naturals), but 
may have no minimum, either because it contains an infinite descending 
chain (like the negative integers) or because it has more than one minimal 
element. Any minimum is also minimal. 

The corresponding terms for elements that are not less than any other 
element or that are greater than all other elements are maximal and maxi- 
mum, respectively. 

Here is an example of the difference between a maximal and a maximum 
element: consider the family of all subsets of N with at most three elements, 
ordered by C. Then {0,1,2} is a maximal element of this family (it’s not a 
subset of any larger set), but it’s not a maximum because it’s not a superset 
of {3}. (The same thing works for any other three-element set.) 

See Figure 9.4 for some more examples. 


particular k we can always choose a bigger k’. 
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9.5.5 Total orders 


If any two elements of a partial order are comparable (that is, if at least one 
of x < yor y < x holds for all x and y), then the partial order is a total 
order. Total orders include many of the familiar orders on the naturals, the 
reals, etc. 

Any partial order (.5,<) can be extended to a total order (generally 
more than one, if the partial order is not total itself). This means that we 
construct a new relation <’ on S$ that is a superset of < and also totally 
ordered. There is a straightforward way to do this when S is finite, called 
a topological sort, and a less straightforward way to do this when S is 
infinite. 


9.5.5.1 Topological sort 


A topological sort is an algorithm for sorting objects that are partially 
ordered, in a way that preserves the partial order. (An example is given in 
Figure 9.5.) It can be used to construct a schedule for executing a sequence of 
operations that depend on each other, and efficient algorithms for topological 
sort exist. We won’t bother with efficiency, and will just use the basic idea 
to show that a total extension of any finite partial order exists. 

The simplest version of this algorithm is to find a minimal element, put 
it first, and then sort the rest of the elements; this is similar to selection 
sort, an algorithm for doing ordinary sorting, where we find the smallest 
element of a set and put it first, find the next smallest element and put it 
second, and so on. In order for the selection-based version of topological sort 
to work, we have to know that there is, in fact, a minimal element.* 


Lemma 9.5.1. Every nonempty finite partially-ordered set has a minimal 
element. 


Proof. Let (S,<) be a nonempty finite partially-ordered set. We will prove 
that S contains a minimal element by induction on |S}. 

If |S] =1, then S = {x} for some z; x is the minimal element. 

If |S| =n > 2, let x be any element of S, and let T= S \ {x}. Then T 
is a nonempty set of size n — 1 <n, and so by the induction hypothesis, T’ 
has at least one minimal element y. For this y, there is no z € T such that 
z<y. 

If y is a minimal element of S, we are done. 


“There may be more than one, but one is enough. 
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Figure 9.5: Topological sort. On the right is a total order extending the 
partial order on the left. 
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If not, there is some element g € S such that q < y. But q can’t be in T, 
so g € S\T, which makes gq = «. So in this case x < y. If x is a minimal 
element of S, then we are done. 

If x is not a minimal element of S, then there is an r < x in S; since 
r#a, it is also in T. But then r < x < y implies r < y, contradicting 
the assumption that y was minimal. Since this case can’t happen, we are 
left with one of the two previous cases, and in both we establish that the 
induction hypothesis holds for n. 


Now we can apply the selection-sort strategy previously described. 
Theorem 9.5.2. Every partial order on a finite set has a total extension. 


Proof. Let (S,<s) be a finite partially-ordered set. 

If S is empty, it contains no pair of incomparable elements, so it is already 
totally ordered. 

For nonempty sets S, we will prove the result by induction on |S}. 

If S is nonempty, then by Lemma 9.5.1, it has a minimal element x. Let 
T =S \ {x} and let <7 be the restriction of <s to T. Then the induction 
hypothesis say that (T’, <7) has a total extension <‘p. Define <‘g by y <‘g z 
fy=a or yo. 

First, let us show that <y extends <g. Suppose a <g b. There are three 
cases: 


1.a,b€T. Thena<sb>a<pboa<pboa <b. 
2. a= a. Then x <% b always. 
3. b= a. Thena<g27a=2>a<g2. 


Next, let us show that <’, is a partial order. This requires verifying that 
adding x to T doesn’t break reflexivity, antisymmetry, or transitivity. For 
reflexivity, « < a from the first case of the definition. For antisymmetry, if 
y <4 « then y=, since y £7 & for any y. For transitivity, if ¢ <6 y <g z 
then « <'g z (since « <‘g z for all z in S), andify <4 ¢ <4 ztheny=2 <4 z 
and if y <6 .z2<6 a then y=z=c. 

Finally, let’s make sure that we actually get a total order. This means 
showing that any y and z in S are comparable. If y £4 z, then y 4 x, and 
either z= a or z€ T and y £7 z implies z <7 y. In either case z <5 y. The 
case y £/g z is symmetric. 


For infinite partial orders the situation is more complicated, because an 
infinite partial order might not have any minimal elements (consider (Z, <)). 
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But we can still extend any partial order to a total order, even on an infinite 
set. 

The intuition is that we can always pick some pair of incomparable 
elements and declare one less than the other, fill in any other relations 
implied by transitivity, and repeat. If we ever reach a partial order where 
we can’t do this, that means we have no incomparable elements, so we have 
a total order. 

Unfortunately this process may take infinitely long, so we have to argue 
that it converges in the limit to a genuine total order using a tool called 
Zorn’s lemma, which itself is a theorem about partial orders.” 


9.5.6 Well orders 


A well order is a particularly restricted kind of total order. A partial order 
is a well order if it is a total order and every nonempty subset S' has a 
minimum element x. An example of a well order is the usual order on N.’ 


>You don’t need to know about Zorn’s Lemma for this course, but if you are curious, 
Zorn’s Lemma says that if (S,<) is any poset, and every totally-ordered subset $’ of S$ 
has an upper bound z, which is an element of S$ (but not necessarily S’) that is greater 
than or equal to any y in S’, then S has a maximal element. 

Zorn’s Lemma is one of the reasons we include the Axiom of Choice in our axiom system. 
It is known that you can’t prove Zorn’s Lemma from just the other axioms without using 
AC, and in the other direction it is possible to prove AC from ZF plus Zorn’s Lemma. 
This explains the classic logician’s riddle, “What’s yellow and equivalent to the Axiom of 
Choice?”° 

Applying Zorn’s Lemma to partial orders, let R be some partial order on a set A, and 
let S be the set of all partial orders R’ on A that are supersets of R, ordered by the subset 
relation. Now given any chain of partial orders Ri C Roe,... in S, their union is also 
a partial order (this requires a proof) and any R; is a subset of the union. So S has a 
maximal partial order R. 

If # is not a total order, then there is some pair of elements x and y that are incomparable. 
Let 


T = RU{(a,y)}U {(a,2) | (yz) € R}U {(w,y) | (w, 2) € R} 


Then T is reflexive (because all the pairs (a, 2) are already in R and thus also in R) and 
transitive (by a tedious case analysis), and antisymmetric (by another tedious case analysis), 
meaning that it is a partial order that extends R—and thus an element of S—while also 
being a proper superset of R. But this contradicts the assumption that R is maximal. So 
R is in fact the total order we are looking for. 

° Answer: Zorn’s lemon. 

"Proof: We can prove that any nonempty S$ C N has a minimum in a slightly roundabout 
way by induction. The induction hypothesis (for x) is that if S contains some element y 
less than or equal to x, then S has a minimum element. The base case is when x = 0; here 
z is the minimum. Suppose now that the claim holds for x. Suppose also that S contains 
some element y < x+ 1; if not, the induction hypothesis holds vacuously. If there is some 
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An equivalent definition is that a total order is a well order if it contains 
no infinite descending chain, which is an infinite sequence 7, > x2 > 
x3 >... To show that this is implied by every set having a least element, 
suppose that a given total order has the least-element property. Then given a 
would-be infinite descending chain 71, 7%2,..., let x; be its least element. But 
then x; is not greater than 2;4,1. For the converse, suppose that some set S 
does not have a least element. Then we can construct an infinite descending 
chain by choosing any x, € S, then for each x;41 choose some element less 
than the smallest of 7, ...x;. Zorn’s Lemma can be used to show that this 
process converges to an infinite descending chain in the limit. 

The useful property of well-orders is that we can do induction on them. 
If it is the case that (a) P(m) holds, where m is the smallest element in some 
set S, and (b) P(2’) for all x < 2’ implies P(x), then P(x) holds for all x in 
S. The proof is that if P(x) doesn’t hold, there is a least element y in S for 
which it doesn’t hold (this is the least element in the set {y € S| =P(y)}, 
which exists because S is well ordered). But this contradicts (a) if y= m 
and (b) otherwise. 

For sets that aren’t well-ordered, this argument generally doesn’t work. 
For example, we can’t do induction on the integers because there is no 
number negative enough to be the base case, and even if we add a new 
minimum element —oo, when we do the induction step we can’t find the 
minimum y in the set of integers excluding —oo. 

It is possible in an infinite set to have a well-ordering in which some 
elements do not have predecessors. For example, consider the order on 
S=NU {w} defined by x < y if either (a) x and y are both in N anda <y 
by the usual ordering on N or (b) y = w. This is a total order that is also 
a well order, but w has no immediate predecessor. In this case we can still 
do induction proofs, but since w is not n+ 1 for any n, we need a special 
case in the proof to handle it. For a more complicated example, the set 
w+w = {0,1,2,...;w,w+1,w+2,...} is also well-ordered, so we can do 
induction on it if we can show P(0), P(x) > P(x +1) (including for cases 
like x =w+5andx+1=w-+6), and P(w) (possibly using P(x) for all 
x <w in the last case). 


y <x, then S has a minimum by the induction hypothesis. The alternative is that there is 
no y in S such that y < x, but there is a y in S with y< x+1. This y must be equal to 
x+1, and so y is the minimum. 
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9.6 Closures 


In general, the closure of some mathematical object with respect to a given 
property is the smallest larger object that has the property. Usually “smaller’ 
and “larger” are taken to mean subset or superset, so we are really looking 
at the intersection of all larger objects with the property, or equivalently we 
are looking for an object that has the property and that is a subset of all 
larger objects with the property. Such a closure always exists if the property 
is preserved by intersection (formally, if (Vi: P(.S;)) ~ P((); Si)) and every 
object has at least one larger object with the property. 

This rather abstract definition can be made more explicit for certain 
kinds of closures of relations. The reflexive closure of a relation R (whose 
domain and codomain are equal) is the smallest super-relation of R that is 
reflexive; it is obtained by adding (x,) to R for all x in R’s domain, which 
we can write as R°UR where R? is just the identity relation on the domain of 
R. The symmetric closure is the smallest symmetric super-relation of R; 
it is obtained by adding (y, x) to R whenever (xz, y) is in R, or equivalently 
by taking RU R7!. The transitive closure is obtained by adding (z, z) to 
R whenever (x,y) and (y, z) are both in R for some y—and continuing to 
do so until no new pairs of this form remain.® The transitive closure can 
also be computed as Rt = R! U R?U R®...; for reflexive R, this is equal 
to R* = R9UR!'UR?.... Even if R is not already reflexive, R* gives the 
reflexive transitive closure of R.° 

In digraph terms, the reflexive closure adds self-loops to all nodes, the 
symmetric closure adds a reverse edge for each edge, and the transitive 
closure adds an edge for each directed path through the graph (see Figure 9.6. 
One can also take the closure with respect to multiple properties, such 
as the reflexive symmetric transitive closure of R, which will be the 
smallest equivalence relation in which any elements that are related by R 
are equivalent. 

Closures provide a way of turning things that aren’t already equivalence 
relations or partial orders into equivalence relations and partial orders. For 
equivalence relations this is easy: take the reflexive symmetric transitive 
closure, and you get a reflexive symmetric transitive relation. For partial 
orders it’s trickier: antisymmetry isn’t a closure property (even though 
it’s preserved by intersection, a non-antisymmetric R can’t be made anti- 
symmetric by adding more pairs). Given a relation R on some set S, the 


? 


’This may not actually terminate if R’s domain is not finite. 
° All of this can be proved by doing lots of induction. 
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Figure 9.6: Reflexive, symmetric, and transitive closures of a relation rep- 
resented as a directed graph. The original relation {(0,1), (1,2)} is on top; 
the reflexive, symmetric, and transitive closures are depicted below it. 
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Figure 9.7: Reducing a graph to its strongly-connected components. On the 
left is the original graph. On the right, each strongly-connected component 
has been contracted to a single vertex. The contracted graph is acyclic. 


best we can do is take the reflexive transitive closure R* and hope that it’s 
antisymmetric. If it is, we are done. If it isn’t, we can observe that the 
relation ~ defined by x ~ y if «R*y and yR*x is an equivalence relation 
(Proof: « ~ x because R* is reflexive, x ~ y > y ~ x from the symmetry of 
the definition, andx&~yAy~z—>a2~ z because transitivity of R* gives 
cR*yAyR*z > «R*z and yR** AzR*y > zR*x). So we can take the quotient 
S/~, which smashes all the equivalence classes of ~ into single points, define 
a quotient relation R*/~ in the obvious way, and this quotient relation will 
be a partial order. This is the relational equivalent of the standard graph- 
theoretic algorithm that computes strongly-connected components (the 
equivalence classes of ~) and constructs a directed acyclic graph from 
the original by contracting each strongly-connected component to a single 
vertex. See Figure 9.7 for an example. 


9.6.1 Examples 


e Let R be the relation on subsets of N given by xRy if there exists some 
n¢«x such that y= xU{n}. The transitive closure of R is the proper 
subset relation C, where « C y if x C y but « # y. The reflexive 
transitive closure R* of R is just the ordinary subset relation C. The 
reflexive symmetric transitive closure of R is the complete relation; 
given any two sets x and y, we can get from x to @ via (R*)~! and 
then to y via R*. So in this case the reflexive symmetric transitive 
closure is not very interesting. 


e Let R be the relation on N given by «Ry if x = 2y. Then the reflexive 
transitive closure R* is the relation given by xR*y if x = 2”y for some 
n €N, and the reflexive symmetric transitive closure is the relation 
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given by x ~ y if x = 2”y or y = 2"x for some n EN. For this R, 
not all elements of the underlying set are equivalent in the reflexive 
symmetric transitive closure; for example, 3 ~ 5, because there is no 
way to transform a 3 into a 5 no matter how many times we multiply 
or divide by 2. 


Chapter 10 


Graphs 


A graph is a structure in which pairs of vertices are connected by edges. 
Each edge may act like an ordered pair (in a directed graph) or an un- 
ordered pair (in an undirected graph). We’ve already seen directed graphs 
as a representation for relations. Most work in graph theory concentrates 
instead on undirected graphs. 

Because graph theory has been studied for many centuries in many 
languages, it has accumulated a bewildering variety of terminology, with 
multiple terms for the same concept (e.g. node for vertex or arc for edge) and 
ambiguous definitions of certain terms (e.g., a “graph” without qualification 
might be either a directed or undirected graph, depending on who is using the 
term: graph theorists tend to mean undirected graphs, but you can’t always 
tell without looking at the context). We will try to stick with consistent 
terminology to the extent that we can. In particular, unless otherwise 
specified, a graph will refer to a finite simple undirected graph: an 
undirected graph with a finite number of vertices, where each edge connects 
two distinct vertices (thus no self-loops) and there is at most one edge 
between each pair of vertices (no parallel edges). 

A reasonably complete glossary of graph theory can be found at at 
http: //en.wikipedia.org/wiki/Glossary_of_graph_theory. See also 
Ferland [Fer08], Chapters 8 and 9; Rosen [Ros12] Chapter 10; or Biggs [Big02] 
Chapter 15 (for undirected graphs) and 18 (for directed graphs). 

If you want to get a fuller sense of the scope of graph theory, Reinhard 
Diestel’s (graduate) textbook Graph Theory|[Dic10] can be downloaded from 
http: //diestel-graph-theory.com. 


152 


CHAPTER 10. GRAPHS 153 


Figure 10.1: A directed graph 


10.1 Types of graphs 


Graphs are represented as ordered pairs G = (V,F), where V is a set of 
vertices and EF a set of edges. The differences between different types of 
graphs depends on what can go in &. When not otherwise specified, we 
usually think of a graph as an undirected graph (see below), but there are 
other variants. Typically we assume that V and E are both finite. 


10.1.1 Directed graphs 


In a directed graph or digraph, each element of F is an ordered pair, and 
we think of edges as arrows from a source, head, or initial vertex to 
a sink, tail, or terminal vertex; each of these two vertices is called an 
endpoint of the edge. A directed graph is simple if there is at most one 
edge from one vertex to another. A directed graph that has multiple edges 
from some vertex u to some other vertex v is called a directed multigraph. 

For simple directed graphs, we can save a lot of ink by adopting the 
convention of writing an edge (u,v) from u to v as just uv. 

Directed graphs are drawn as in Figure 10.1. 

As we saw in the notes on relations, there is a one-to-one correspondence 
between simple directed graphs with vertex set V and relations on V. 


10.1.2 Undirected graphs 


In an undirected graph, each edge is an undirected pair, which we can 
represent as subset of V with one or two elements. A simple undirected 
graph contains no duplicate edges and no loops (an edge from some vertex 
u back to itself); this means we can represent all edges as two-element subsets 
of V. Most of the time, when we say graph, we mean a simple undirected 
graph. Though it is possible to consider infinite graphs, for convenience we 
will limit ourselves to finite graphs, where n = |V| and m = |E| are both 
natural numbers. 
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Figure 10.2: A graph 


As with directed graphs, instead of writing an edge as {u,v}, we will 
write an edge between u and v as just uv. Note that in an undirected graph, 
uv and vu are the same edge. 

Graphs are drawn just like directed graphs, except that the edges don’t 
have arrowheads on them. See Figure 10.2 for an example. 

If we have loops or parallel edges, we have a more complicated structure 
called a multigraph. This requires a different representation where elements 
of EF are abstract edges and we have a function mapping each element of 
FE to its endpoints. Some authors make a distinction between pseudographs 
(with loops) and multigraphs (without loops), but we’ll use multigraph for 
both. 

Simple undirected graphs also correspond to relations, with the restriction 
that the relation must be irreflexive (no loops) and symmetric (undirected 
edges). This also gives a representation of undirected graphs as directed 
graphs, where the edges of the directed graph always appear in pairs going 
in opposite directions. 


10.1.3. Hypergraphs 


In a hypergraph, the edges (called hyperedges) are arbitrary nonempty 
sets of vertices. A k-hypergraph is one in which all such hyperedges 
connected exactly k vertices; an ordinary graph is thus a 2-hypergraph. 

Hypergraphs can be drawn by representing each hyperedge as a closed 
curve containing its members, as in the left-hand side of Figure 10.3. 

Hypergraphs aren’t used very much, because it is always possible (though 
not always convenient) to represent a hypergraph by a bipartite graph. In 
a bipartite graph, the vertex set can be partitioned into two subsets S and T, 
such that every edge connects a vertex in S with a vertex in TJ’. To represent 
a hypergraph A as a bipartite graph, we simply represent the vertices of H 
as vertices in S and the hyperedges of H as vertices in 7’, and put in an edge 
(s,t) whenever s is a member of the hyperedge t in H. The right-hand side 
of Figure 10.3 gives an example. 
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oe 


Figure 10.3: Two representations of a hypergraph. On the left, four vertices 
are connected by three hyperedges. On the right, the same four vertices are 
connected by ordinary edges to new vertices representing the hyperedges. 


10.2 Examples of graphs 


Any relation produces a graph, which is directed for an arbitrary relation 
and undirected for a symmetric relation. Examples are graphs of parenthood 
(directed), siblinghood (undirected), handshakes (undirected), etc. 

Graphs often arise in transportation and communication networks. Here’s 
a (now very out-of-date) route map for Jet Blue airlines, originally taken 
from http://www. jetblue.com/travelinfo/routemap. html: 


seattle 
syracuse burlington 
rochester ' pean) 
sacramento _ Salt lake city 
oakland ——— » ¥ 
san jose ——— =, 


ontario 
long beach 
san diego 


EE —— fort lauderdale \ 
santiago be | 


santo domingo | 

nas en — ee 
aguadilla san juan 

Such graphs are often labeled with edge lengths, prices, etc. In computer 

networking, the design of network graphs that permit efficient routing of data 


without congestion, roundabout paths, or excessively large routing tables is 
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a central problem. 

The web graph is a directed multigraph with web pages for vertices 
and hyperlinks for edges. Though it changes constantly, its properties have 
been fanatically studied both by academic graph theorists and employees 
of search engine companies, many of which are still in business. Companies 
like Google base their search rankings largely on structural properties of the 
web graph. 

Peer-to-peer systems for data sharing often have a graph structure, 
where each peer is a node and connections between peers are edges. The 
problem of designing efficient peer-to-peer systems is similar in many ways 
to the problem of designing efficient networks; in both cases, the structure 
(or lack thereof) of the underlying graph strongly affects efficiency. 


10.3. Local structure of graphs 


There are some useful standard terms for describing the immediate connec- 
tions of vertices and edges: 


e Incidence: a vertex is incident to any edge of which it is an endpoint 
(and vice versa). 


e Adjacency, neighborhood: two vertices are adjacent if they are the 
endpoints of some edge. The neighborhood of a vertex v is the set 
of all vertices that are adjacent to v. 


e Degree, in-degree, out-degree: the degree of v counts the number edges 
incident to v. In a directed graph, in-degree counts only incoming 
edges and out-degree counts only outgoing edges (so that the degree 
is always the in-degree plus the out-degree). The degree of a vertex 
v is often abbreviated as d(v); in-degree and out-degree are similarly 
abbreviated as d~(v) and d*(v), respectively. 


10.4 Some standard graphs 


Most graphs have no particular structure, but there are some families of 
graphs for which it is convenient to have standard names. Some examples 
are: 


e Complete graph K,,. This has n vertices, and every pair of vertices 
has an edge between them. See Figure 10.4. 
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Figure 10.4: Complete graphs Kk, through Kyo 
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Figure 10.5: Cycle graphs C3 through Cy 


e Cycle graph C,,. This has vertices {0,1,...n— 1} and an edge from i 
toz+1 for each 2, plus an edge from n — 1 to 0. For any cycle, n must 
be at least 3. See Figure 10.5. 


e Path P,. This has vertices {0,1,2,...n} and an edge from 7 to i+ 1 
for each 7. Note that, despite the usual convention, n counts the number 
of edges rather than the number of vertices; we call the number of 
edges the length of the path. See Figure 10.6. 


e Complete bipartite graph K,,,,. This has a set A of m vertices 
and a set B of n vertices, with an edge between every vertex in A and 
every vertex in B, but no edges within A or B. See Figure 10.7. 
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Figure 10.6: Path graphs Po through Py 


K34 


Figure 10.7: Complete bipartite graph K3 4 


e Star graphs. These have a single central vertex that is connected to 
n outer vertices, and are the same as K1,,. See Figure 10.8. 


e The cube Q,. This is defined by letting the vertex set consist of all 
n-bit strings, and putting an edge between u and w’ if u and wu’ differ 
in exactly one place. It can also be defined by taking the n-fold square 
product of an edge with itself (see §10.6). 


e Cayley graphs. The Cayley graph of a group G with a given set of 
generators S' is a labeled directed graph. The vertices of this graph are 
the group elements, and for each element g in G and generator s in S 
there is a directed edge from g to gs labeled with s. An example of a 
small Cayley graph, based on the dihedral group D, of symmetries 
of the square, is given in Figure 10.9. 


Many common graphs are Cayley graphs with the labels (and possibly 
edge orientations) removed; for example, a directed cycle on m elements 
is the Cayley graph of Z,, with generator 1, an n x m torus is the 
Cayley graph of Z, x Z, with generators (1,0) and (0,1), and the 
cube Q, is the Cayley graph of (Z2)” with generators all vectors that 
are zero in all positions but one. 


Graphs may not always be drawn in a way that makes their structure 
obvious. For example, Figure 10.10 shows two different presentations of Qs, 
neither of which looks much like the other. 
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Figure 10.8: star graphs AK, 3 through Ky. 
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Figure 10.9: Cayley graph of the dihedral group D4, with generators a 
corresponding to a clockwise rotation (red arrows) and b corresponding to a 
flip around the vertical axis (blue arrows). Note that this is a directed graph. 
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Figure 10.10: Two presentations of the cube graph Q3 


10.5 Subgraphs and minors 


A graph G is a subgraph of of a graph H, written G C H, if Ve C Vy and 
Eg C Ey. We will also sometimes say that G is a subgraph of A if it is 
isomorphic to a subgraph of H, which is equivalent to having an injective 
homomorphism from G to H. 

One can get a subgraph by deleting edges or vertices or both. Note that 
deleting a vertex also requires deleting any edges incident to the vertex (since 
we can’t have an edge with a missing endpoint). If we delete as few edges as 
possible, we get an induced subgraph. Formally, the subgraph of a graph 
HT whose vertex set is S and that contains every edge in H with endpoints 
in S is called the subgraph of H induced by S. 

A minor of a graph 4H is a graph obtained from H by deleting edges 
and/or vertices (as in a subgraph) and contracting edges, where two ad- 
jacent vertices u and v are merged together into a single vertex that is 
adjacent to all of the previous neighbors of both vertices. Minors are useful 
for recognizing certain classes of graphs. For example, a graph can be drawn 
in the plane without any crossing edges if and only if it doesn’t contain Ks 
or 433 as a minor (this is known as Wagner’s theorem). 

Figure 10.11 shows some subgraphs and minors of the graph from Fig- 
ure 10.2. 
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Figure 10.11: Examples of subgraphs and minors. Top left is the original 
graph. Top right is a subgraph that is not an induced subgraph. Bottom 
left is an induced subgraph. Bottom right is a minor. 


10.6 Graph products 


There are at least five different definitions of the product of two graphs used 
by serious graph theorists. In each case the vertex set of the product is the 
Cartesian product of the vertex sets, but the different definitions throw in 
different sets of edges. Two of them are used most often: 


e The square product or graph Cartesian product GLH. An edge 
(u,u’)(v, v’) is in GO Z if and only if (a) w= v and w’v’ is an edge in 
H, or (b) wv is an edge in G and uw’ = v’. It’s called the square product 
because the product of two (undirected) edges looks like a square. The 
intuition is that each vertex in G is replaced by a copy of H, and then 
corresponding vertices in the different copies of H are linked whenever 
the original vertices in G are adjacent. For algebraists, square products 
are popular because they behave correctly for Cayley graphs: if C) 
and C2 are the Cayley graphs of G; and G2 (for particular choices of 
generators), then C; DO C2 is the Cayley graph of Gi x Go. 


— The cube Q, can be defined recursively by Q; = P; and Qn = 


Qn—1 U1 Q1. It is also the case that Qn = Q, UO Qn_r.- 
— An n-by-m mesh is given by P,-1 0) Pry-1. 
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e The cross product or categorical graph product G x H. Now 
(u, u’)(v, v’) is in G x H if and only if wv is in G and w'v’ is in H. In the 
cross product, the product of two (again undirected) edges is a cross: 
an edge from (u, wu’) to (v,v’) and one from (u,v’) to (v,u’). The cross 
product is not as useful as the square product for defining nice-looking 
graphs, but it can arise in some other situations. An example is when 
G and H describe the positions (vertices) and moves (directed edges) 
of two solitaire games; then the cross product G x H describes the 
combined game in which at each step the player must make a move in 
both games. (In contrast, the square product GO H describes a game 
where the player can choose at each step to make a move in either 
game.) 


10.7 Functions between graphs 


A function from a graph G to another graph H typically maps Vg to Vz, 
with the edges coming along for the ride. For simplicity, we will generally 
write f : G— H when we really mean f : Vo > Vy. 

A function f : G > H is a graph homomorphism if, for every edge 
uv in G, f(u)f(v) is an edge in H. Note that this only goes one way: it is 
possible to have an edge f(u)f(v) in H but no edge uv in G. Generally we 
will only be interested in functions between graphs that are homomorphisms, 
and even among homomorphisms, some functions are more interesting than 
others. 

A graph homomorphism that has an inverse that is also a graph ho- 
momorphism is called an graph isomorphism. Two graphs G and H are 
isomorphic if there is an isomorphism between them. Intuitively, this 
means that G and H are basically the same graph, with different names 
for the vertices, and we will often treat them as the same graph. So, for 
example, we will think of a graph G = (V,F) where V = {1,3,5} and 
E = {{1,3}, {3,5}, {1,5}} as an instance of C3 and K3 even if the vertex 
labels are not what we might have chosen by default. To avoid confusion 
with set equality, we write G = H when G and H are isomorphic. 

Every graph is isomorphic to itself, because the identity function is an 
isomorphism. But some graphs have additional isomorphisms. An isomor- 
phism from G to G is called an automorphism of G and corresponds to 
an internal symmetry of G. For example, the cycle C, has 2n different 
automorphisms (to count them, observe there are n places we can send 
vertex 0 to, and having picked a place to send vertex 0 to, there are only 2 
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places to send vertex 1; so we have essentially n rotations times 2 for flipping 
or not flipping the graph). A path P, (when n > 1) has 2 automorphisms 
(reverse the direction or not). Many graphs have no automorphisms except 
the identity map. 

An injective homomorphism from G to H is an isormophism between G 
and some subgraph H’ of H. In this case, we often say that G is a subgraph 
of H, even though technically it is just a copy of G that appears as a subgraph 
of H. This allows us to say, for example, that P,, is a subgraph of P,,41,! or 
all graphs on at most n vertices are subgraphs of Ky. 

Homomorphisms that are not injective are not as useful, but they can 
can sometimes be used to characterize particular classes of graphs indirectly. 
For example, There is a homomorphism from a graph G to P, if and only if 
G is bipartite (see §C.7.2 for a proof). In general, there is a homomorphism 
from G to K,, if and only if G is n-partite (recall P, = Ko). 


10.8 Paths and connectivity 


A fundamental property of graphs is connectivity: whether the graph can 
be divided into two or more pieces with no edges between them. Often it 
makes sense to talk about this in terms of reachability, or whether you can 
get from one vertex to another along some path. 

The pedantic definition of a path path of length n in a graph is the 
image of a homomorphism from P,. In ordinary speech, it’s a sequence 
of n+ 1 vertices vp, U1,..-,;Un Such that v;v;41 is an edge in the graph for 
each i. A path is simple if the same vertex never appears twice (i.e. if the 
homomorphism is injective). If there is a path from wu to v, there is a simple 
path from u to v obtained by removing cycles (Lemma 10.10.1). 

If there is a path from u to v, then v is reachable from u: u— v. We 
also say that u is connected to v. It’s easy to see that connectivity is 
reflexive (take a path of length 0) and transitive (paste a path from wu to v 
together with a path from v to w to get a path from u to w). But it’s not 
necessarily symmetric if we have a directed graph. 

In an undirected graph, connectivity is symmetric, so it’s an equivalence 
relation. The equivalence classes of > are called the connected compo- 
nents of G, and G itself is connected if and only if it has a single connected 
component, i.e., if every vertex is reachable from every other vertex. (Note 
that isolated vertices count as (separate) connected components.) 


"In four different ways! 
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In a directed graph, we can make connectivity symmetric in one of two 
different ways: 


e Define u to be strongly connected to v if uv and v > u. Le., u 
and v are strongly connected if you can go from u to v and back again 
(not necessarily through the same vertices). 


It’s easy to see that strong connectivity is an equivalence relation. 
The equivalence classes are called strongly-connected components. 
A graph G is strongly connected if it has one strongly-connected 
component, i.e., if every vertex is reachable from every other vertex. 


e Define u to be weakly connected to v if u — v in the undirected 
graph obtained by ignoring edge orientation. The intuition is that 
u is weakly connected to v if there is a path from u to v if you are 
allowed to cross edges backwards. Weakly-connected components are 
defined by equivalence classes; a graph is weakly-connected if it has 
one component. Weak connectivity is a “weaker” property that strong 
connectivity in the sense that if u is strongly connected to v, then u is 
weakly connected to v; but the converse does not necessarily hold. 


The k-th power G* of a graph G has the same vertices as G, but wv is 
an edge in G* if and only if there is a path of length k from u to v in G. 
The transitive closure of a directed graph: G* = Uf) G*. Le., there is 
an edge uv in G* if and only if there is a path (of any length, including zero) 
from u to v in G, or in other words if u > v. This is equivalent to taking 
the transitive closure of the adjacency relation. 


10.9 Cycles 


The standard cycle graph C,, has vertices {0,1,...,n — 1} with an edge from 
i toi+1 for each 7 and from n — 1 to 0. To avoid degeneracies, n must be 
at least 3. A simple cycle of length n in a graph G is an embedding of C;, 
in G: this means a sequence of distinct vertices ugv1v2...Un—1, Where each 
pair vjvj41 is an edge in G, as well as vp_ v9. If we omit the requirement 
that the vertices are distinct, but insist on distinct edges instead, we have a 
cycle. If we omit both requirements, we get a closed walk; this includes 
very non-cyclic-looking walks like the short excursion uvu. We will mostly 
worry about cycles.” See Figure 10.12 


?Some authors reserve cycle for what we are calling a simple cycle, and use circuit for 
cycle. 
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Figure 10.12: Examples of cycles and closed walks. Top left is a graph. Top 
right shows the simple cycle 1253 found in this graph. Bottom left shows 
the cycle 124523, which is not simple. Bottom right shows the closed walk 
12546523, which uses the 25 edge twice. 


Unlike paths, which have endpoints, no vertex in a cycle has a special 
role. 

A graph with no cycles is acyclic. Directed acyclic graphs or DAGs 
have the property that their reachability relation —> is a partial order; this 
is easily proven by showing that if > is not anti-symmetric, then there is a 
cycle consisting of the paths between two non-anti-symmetric vertices u yu 
and v - u. Directed acyclic graphs may also be topologically sorted: 
their vertices ordered as v9, V1,-.-.,Un—1, SO that if there is an edge from 
vj to vj, then i < j. The proof is by induction on |V|, with the induction 
step setting v,_1 to equal some vertex with out-degree 0 and ordering the 
remaining vertices recursively. (See §9.5.5.1.) 

Connected acyclic undirected graphs are called trees. A connected graph 
G = (V, £) is a tree if and only if |E| = |V| — 1; we’ll prove this and other 
characterizations of tree in §10.10.3. 

A cycle that includes every edge exactly once is called an Eulerian cycle 
or Eulerian tour, after Leonhard Euler, whose study of the Seven bridges 
of Kénigsberg problem led to the development of graph theory. A cycle 
that includes every vertex exactly once is called a Hamiltonian cycle or 
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Hamiltonian tour, after William Rowan Hamilton, another historical graph- 
theory heavyweight (although he is more famous for inventing quaternions and 
the Hamiltonian). Graphs with Eulerian cycles have a simple characterization: 
a graph has an Eulerian cycle if and only if every vertex has even degree. 
Graphs with Hamiltonian cycles are harder to recognize. 


10.10 Proving things about graphs 


Suppose we want to show that all graphs or perhaps all graphs satisfying 
certain criteria have some property. How do we do this? In the ideal case, 
we can decompose the graph into pieces somehow and use induction on the 
number of vertices or the number of edges. If this doesn’t work, we may 
have to look for some properties of the graph we can exploit to construct an 
explicit proof of what we want. 


10.10.1 Paths and simple paths 


If all we care about is connectivity, we can avoid making distinctions between 
paths and simple paths. 


Lemma 10.10.1. Jf there is a path from s to t in G, there is a simple path 
from s tot inG. 


Proof. By induction on the length of the path. Specifically, we will show 
that if there is a path from s to t of length k, there is a simple path from s 
to t. 

The base case is when k = 1; then the path consists of exactly one edge 
and is simple. 

For larger k, let s = v9...vzy = t bea path in G. If this path is simple, we 
are done. Otherwise, there exist positions 7 < j such that vj; = v;. Construct 
anew path v1... vjvj41... vx; this is an st path of length less than k, so by 
the induction hypothesis a simple s—t path exists. 


The converse of this lemma is trivial: any simple path is also a path. 
Essentially the same argument works for cycles: 


Lemma 10.10.2. If there is a cycle in G, there is a simple cycle in G. 


Proof. As in the previous lemma, we prove that there exists a simple cycle 
if there is a cycle of length k for any k, by induction on k. First observe 
that the smallest possible cycle has length 3, since anything shorter either 
doesn’t get back to its starting point or violates the no-duplicate edges 
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requirement. So the base case is k = 3, and it’s easy to see that all 3-cycles 
are simple. For larger k, if vov1...vz—1 is a k-cycle that is not simple, there 
exist 7 < j with uv; = v;; patch the edges between them out to get a smaller 
cycle up... UjVj41-.-Up—1- The induction hypothesis does the rest of the 
work. 


10.10.2. The Handshaking Lemma 


This lemma relates the total degree of a graph to the number of edges. The 
intuition is that each edge adds one to the degree of both of its endpoints, 
so the total degree of all vertices is twice the number of edges. 


Lemma 10.10.3. For any graph G = (V,E), 


a d(v) = 2|E|. 

vEeV 
Proof. By induction on m = |E|. If m = 0, G has no edges, and > ,<y d(v) = 
vey 0 =0= 2m. If m > 0, choose some edge st and let G’ = G — st be the 


subgraph of G obtained by removing st. Applying the induction hypothesis 
to G’, 


2(m = 1) = ye dqi(v) 
vEV 

do dar(v) + dar(s) + dar(t) 
veV\{s,t} 


>!  da(v) + (de(s) — 1) + (det) - 1) 


veV\{s,t} 


= Ss dg(v) — 2. 


vEV 
So ucv dg(v) — 2 = 2m — 2, giving Vey de(v) = 2m. 


One application of the lemma is that the number of odd-degree vertices 
in a graph is always even (take both sides mod 2). Another, that we’ll 
use below, is that if a graph has very few edges, then it must have some 
low-degree vertices. 


10.10.3. Characterizations of trees 


A tree is defined to be an acyclic connected graph. There are several 
equivalent characterizations. 
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Theorem 10.10.4. A graph is a tree if and only if there is exactly one 
simple path between any two distinct vertices. 


Proof. A graph G is connected if and only if there is at least one simple path 
between any two distinct vertices. We’ll show that it is acyclic if and only if 
there is at most one simple path between any two distinct vertices. 

First, suppose that G has two distinct simple paths u = vjv2...uz =v 
and u = vjvh...vp =v. Let 7 be the largest index for which v; = v}; under 
the assumption that the paths are distinct and simple , we have i < min(k, £). 
Let j > i be the smallest index for which v; = vj, for some m > i; we 
know that some such j exists because, if nothing else, vz = ve. Let m is the 
smallest such m. 

Now construct a cycle ujj41...UjUj,_1Um_—2 ++. U; = Ui. This is in fact a 
simple cycle, since the v,; are all distinct, the v4, are all distinct, and if any 
vu, with i <r <j equals uv, with i < s < _m, then j or m is not minimal. It 
follows that if G has two distinct simple paths between the same vertices, it 
contains a simple cycle, and is not acyclic. 

Conversely, suppose that G is not acyclic, and let vjvg...v,g = v1 bea 
simple cycle in G. Then vjvg and v2...vg are both simple paths between 
v, and v2, one of which contains v3 and one of which doesn’t. So if G is 
not acyclic, it contains more than one simple path between some pair of 
vertices. 


An alternative characterization counts the number of edges: we will show 
that any graph with less than |V| — 1 edges is disconnected, and any graph 
with more than |V| — 1 edges is cyclic. With exactly |V| — 1 edges, we will 
show that a graph is connected if and only if it is acyclic. 

The main trick involves reducing a |V| by removing a degree-1 vertex. 
The following lemma shows that this does not change whether or not the 
graph is connected or acyclic: 


Lemma 10.10.5. Let G be a nonempty graph, and let v be a vertex of G 
with d(v) = 1. Let G—v be the induced subgraph of G obtained by deleting v 
and its unique incident edge. Then 


1. G is connected if and only if G — v is connected. 
2. G is acyclic if and only if G—v is acyclic. 


Proof. Let w be v’s unique neighbor. 
If G is connected, for any two vertices s and t, there is a simple s—t path. 
If neither s nor ¢ is v, this path can’t include v, because w would appear 
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both before and after v in the path, violating simplicity. So for any s, t in 
G — v, there is an s—t path in G — v, and G — v is connected. 

Conversely, if G — v is connected, then any s and ¢ not equal to v 
remain connected after adding vw, and if s = v, for any ¢ there is a path 
W=v,...U~, =t, from which we can construct a path vv,...vg = t from v 
to t. The case t = v is symmetric. 

If G contains a cycle, then it contains a simple cycle; this cycle can’t 
include v, so G — v also contains the cycle. 

Conversely, if G — v contains a cycle, this cycle is also in G. 


Because a graph with two vertices and fewer than one edges is not 
connected, Lemma 10.10.5 implies that any graph with fewer than |V| — 1 
edges is not connected. 


Corollary 10.10.6. Let G=(V,E). If |E| <|V|—1, G is not connected. 


Proof. By induction on n = |V]. 

For the base case, if n = 0, then |E| =0 ¢n—-1. 

For larger n, suppose that n > 1 and |E| < n—1. From Lemma 10.10.3 
we have )>,, d(v) < 2n — 2, from which it follows that there must be at least 
one vertex v with d(v) < 2. If d(v) = 0, then G is not connected. If d(v) = 1, 
then G is connected if and only if G — v is connected. But G—v has n— 1 
vertices and |E| — 1 < n — 2 edges, so by the induction hypothesis, G — v is 
not connected. So in either case, |E| <n—1 implies G is not connected. 


In the other direction, combining the lemma with the fact that the unique 
graph K3 with three vertices and at least three edges is cyclic tells us that 
any graph with at least as many edges as vertices is cyclic. 


Corollary 10.10.7. Let G=(V,E). If |E| >|V|—1, G contains a cycle. 


Proof. By induction on n = |V]. 
For n < 2, |E| # |V — 1], so the claim holds vacuously.® 
For larger n, there are two cases: 


1. Some vertex v has degree d(v) < 1. Let G’ = (V’, FE’) =G-—v. Then 
|E’| > |E|-1>|V|-—2=|V’| —1, and by the induction hypothesis 
G’ contains a cycle. This cycle is also in G. 


2. Every vertex v in G has d(v) > 2. Let’s go for a walk: starting at 
some vertex vg, choose at each step a vertex v;+1 adjacent to v; that 


3In fact, no graph with |V| <2 contains a cycle, but we don’t need to use this. 
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does not already appear in the walk. This process finishes when we 
reach a node vy, all of whose neighbors appear in the walk in a previous 
position. One of these neighbors may be vg—1; but since d(vz,) > 2, 
there is another neighbor vj; 4 vz_1. So vu; ...vzv; forms a cycle. 


Now we can prove the full result: 


Theorem 10.10.8. Let G = (V,E) be a nonempty graph. Then any two of 
the following statements implies the third: 


1. 


2. 


a. 


G is connected. 
G is acyclic. 


ee eee 


Proof. We will use induction on n for some parts of the proof. The base case 
is when n = 1; then all three statements hold always. For larger n, we show: 


(1) and (2) imply (3): Use Corollary 10.10.6 and Corollary 10.10.7. 


(1) and (3) imply (2). From Lemma 10.10.3, ,ey d(v) = 2(n—1) < 2n. 
It follows that there is at least one v with d(v) < 1. Because G is 
connected, we must have d(v) = 1. So G’ = G—v isa graph with n—2 
edges and n — 1 vertices. It is connected by Lemma 10.10.5, and thus 
it is acyclic by the induction hypothesis. Applying the other case of 
Lemma 10.10.5 in the other direction shows G is also acyclic. 


(2) and (3) imply (1). As in the previous case, G contains a vertex 
v with d(v) < 1. If dv) = 1, then G — v is a nonempty graph with 
n — 2 edges and n — 1 vertices that is acyclic by Lemma 10.10.5. It 
is thus connected by the induction hypothesis, so G is also connected 
by Lemma 10.10.5. If d(v) = 0, then G — v has n — 1 edges and n — 1 
vertices. From Corollary 10.10.7, G — v contains a cycle, contradicting 


(2). 


For an alternative proof based on removing edges, see [Big02, Theorem 


15.5). 


This also gives the useful fact that removing one edge from a tree gives 


exactly two components. 
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10.10.4 Spanning trees 


Here’s another induction proof on graphs. A spanning tree of a nonempty 
connected graph G' is a subgraph of G that includes all vertices and is a tree 
(i-e., is connected and acyclic). 


Theorem 10.10.9. Every nonempty connected graph has a spanning tree. 


Proof. Let G = (V,E) be a nonempty connected graph. We’ll show by 
induction on |F| that G has a spanning tree. The base case is |E| = |V| — 1 
(the least value for which G can be connected); then G itself is a tree (by the 
theorem above). For larger ||, the same theorem gives that G contains a 
cycle. Let uv be any edge on the cycle, and consider the graph G — uv; this 
graph is connected (since we can route any path that used to go through wv 
around the other edges of the cycle) and has fewer edges than G, so by the 
induction hypothesis there is some spanning tree T’ of G — uv. But then T 
also spans G, so we are done. 


10.10.5 Eulerian cycles 


Let’s prove the vertex degree characterization of graphs with Eulerian cycles. 
As in the previous proofs, we’ll take the approach of looking for something 
to pull out of the graph to get a smaller case. 


Theorem 10.10.10. Let G be a connected graph. Then G has an Eulerian 
cycle if and only if all nodes have even degree. 


Proof. e (Only if part). Fix some cycle, and orient the edges by the 
direction that the cycle traverses them. Then in the resulting directed 
graph we must have d~(u) = d*(u) for all u, since every time we enter 
a vertex we have to leave it again. But then d(w) = 2d*(u) is even. 


e (If part, sketch of proof). Suppose now that d(u) is even for all u. We 
will construct an Eulerian cycle on all nodes by induction on |E|. The 
base case is when |E| = 2|V| and G = Cjyj. For a larger graph, choose 
some starting node uj, and construct a path ujug... by choosing 
an arbitrary unused edge leaving each u;; this is always possible for 
u; # uz since whenever we reach u; we have always consumed an even 
number of edges on previous visits plus one to get to it this time, 
leaving at least one remaining edge to leave on. Since there are only 
finitely many edges and we can only use each one once, eventually we 
must get stuck, and this must occur with uz = u, for some k. Now 
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delete all the edges in u,...u, from G, and consider the connected 
components of G — (ui ...ux). Removing the cycle reduces d(v) by an 
even number, so within each such connected component the degree 
of all vertices is even. It follows from the induction hypothesis that 
each connected component has an Eulerian cycle. We'll now string 
these per-component cycles together using our original cycle: while 
traversing u,...,uz, when we encounter some component for the first 
time, we take a detour around the component’s cycle. The resulting 
merged cycle gives an Eulerian cycle for the entire graph. 


Why doesn’t this work for Hamiltonian cycles? The problem is that in a 
Hamiltonian cycle we have too many choices: out of the d(u) edges incident 
to u, we will only use two of them. If we pick the wrong two early on, this 
may prevent us from ever fitting u into a Hamiltonian cycle. So we would 
need some stronger property of our graph to get Hamiltonicity. 


Chapter 11 


Counting 


Counting is the process of creating a bijection between a set we want to 
count and some set whose size we already know. Typically this second set 
will be a finite ordinal [n] = {0,1,...,n—1}.1 

Counting a set A using a bijection f : A — [n] gives its size |A| = n; 
this size is called the cardinality of n. As a side effect, it also gives a 
well-ordering of A, since [n] is well-ordered as we can define x < y for x,y in 
A by x < yif and only if f(x) < f(y). Often the quickest way to find f is to 
line up all the elements of A in a well-ordering and then count them off: the 
smallest element of A gets mapped to 0, the next smallest to 1, and so on. 
Stripped of the mathematical jargon, this is exactly what you were taught 
to do as a small child. 

Usually we will not provide an explicit bijection to compute the size of a 
set, but instead will rely on standard counting principles based on how we 
constructed the set. The branch of mathematics that studies sets constructed 
by combining other sets is called combinatorics, and the sub-branch that 
counts these sets is called enumerative combinatorics. In this chapter, 
we’re going to give an introduction to enumerative combinatorics, but this 
basically just means counting. 

For infinite sets, cardinality is a little more complicated. The basic idea 
is that we define |A| = |B| if there is a bijection between them. This gives an 
equivalence relation on sets”, and we define |A| to be the equivalence class 
of this equivalence relation that contains A. For the finite case we represent 


Starting from 0 is traditional in computer science, because it makes indexing easier. 
Normal people count to n using {1,2,...,n}. 

?Reflexivity: the identity function is a bijection from A to A. Symmetry: if f: A> B 
is a bijection, so is f~': B > A. Transitivity: if f : A Band g: B > C are bijections, 
sois(gof):A>C. 


174 
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the equivalence classes by taking representative elements [n]. 
For the most part we will concentrate on counting finite sets, but will 
mention where the rules for finite sets break down with infinite sets. 


11.1 Basic counting techniques 


Our goal here is to compute the size of some set of objects, e.g., the number 
of subsets of a set of size n, the number of ways to put k cats into n boxes 
so that no box gets more than one cat, etc. 

In rare cases we can use the definition of the size of a set directly, by 
constructing a bijection between the set we care about and some canonical set 
[n]. For example, the set S, = {t EN|a< n? Ady: a =y"} has exactly n 
members, because we can generate it by applying the one-to-one correspon- 
dence f(y) = y” to the set {0,1,2,3,...,n —1} = [n]. But most of the time 
constructing an explicit one-to-one correspondence is too time-consuming 
or too hard, so instead we will show how to map set-theoretic operations to 
arithmetic operations, so that from a set-theoretic construction of a set we 
can often directly read off an arithmetic computation that gives the size of 
the set. 


11.1.1 Equality: reducing to a previously-solved case 


If we can produce a bijection between a set A whose size we don’t know and 
a set B whose size we do, then we get |A| = |B|. Pretty much all of our 
proofs of cardinality will end up looking like this. 


11.1.2 Inequalities: showing |A| < |B| and |B| < |A| 


We write |A| < |B] if there is an injection f : A B, and similarly |B] < |A| 
if there is an injection g: B > A. If both conditions hold, then there is a 
bijection between A and B, showing |A| = |B|. This fact is trivial for finite 
sets, but for infinite sets—even though it is still true—the actual construction 
of the bijection is a little trickier.? 


°The claim for general sets is known as the Cantor-Bernstein-Schroeder theorem. One 
way to prove this is to assume that A and B are disjoint and construct a (not necessarily 
finite) graph whose vertex set is AU B and that has edges for all pairs (a, f(a)) and 
(b, g(b)). It can then be shown that the connected components of this graph consist of (1) 
finite cycles, (2) doubly-infinite paths (i.e., paths with no endpoint in either direction), (3) 
infinite paths with an initial vertex in A, and (4) infinite paths with an initial vertex in B. 
For vertexes in all but the last class of components, define h(x) to be f(x) if x is in A and 
f-‘(2) if x is in B. (Note that we are abusing notation slightly here by defining f~ +(x) 
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Similarly, if we write |A| > |B| to indicate that there is a surjection from 
A to B, then |A| > |B| and |B| > |A| implies |A| = |B]. The easiest way to 
show this is to observe that if there is a surjection f : A > B, then we can 
get an injection f’ : B > A by letting f’(y) be any element of {x | f(x) = y}, 
thus reducing to the previous case (this requires the Axiom of Choice, but 
pretty much everybody assumes the Axiom of Choice). Showing an injection 
f:A— Band a surjection g: A —> B also works. 

For example, |Q| = |N|. Proof: |N| < |Q| because we can map any n in 
N to the same value in Q; this is clearly an injection. To show |Q| < |N|, 
observe that we can encode any element +p/q of Q, where p and q are both 
natural numbers, as a triple (s,p,q) where (s € {0,1} indicates + (0) or — 
(1); this encoding is clearly injective. Then use the Cantor pairing function 
(§3.7.1) twice to crunch this triple down to a single natural number, getting 
an injection from Q to N. 


11.1.3 Addition: the sum rule 


The sum rule computes the size of AU B when A and B are disjoint. 


Theorem 11.1.1. Jf A and B are finite sets with AN B=, then 
|AU B| = |A] + |B. 


Proof. Let f : A — [|A|] and g: B — [|B]] be bijections. Define h: AUB — 
[|A] + |B|] by the rule h(x) = f(x) for e € A, h(x) = |A| + g(x) for cE B. 

To show that this is a bijection, define h~!(y) for y in [|A| + |b|] to be 
f-l(y) if y < |A| and g~+(y — |A|) otherwise. Then for any y in [|A| +|B]], 
either 


1.0<y< Al, y is in the codomain of f (so h7!(y) = f7l(y) € A is 
well-defined), and h(h~+(y)) = f(f~*(y)) = y. 


2. |A| < y < |A|+|B|. In this case 0 < y—|A| <|B|, putting y —|A| in 
the codomain of g and giving h(h~'(y)) = g(g-(y — |A|)) + |A| =y. 


So h~' is in fact an inverse of h, meaning that h is a bijection. 


to be the unique y that maps to x when it exists.) For the last class of components, the 
initial B vertex is not the image of any x under f; so for these we define h(a) to be g(a) if 
x isin B and g~'(a) if x is in A. This gives the desired bijection h between A and B. 

In the case where A and B are not disjoint, we can make them disjoint by replacing 
them with A’ = {0} x A and B’ = {1} x B. (This is a pretty common trick for enforcing 
disjoint unions.) 
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One way to think about this proof is that we are constructing a total 
order on AU B by putting all the A elements before all the B elements. This 
gives a straightforward bijection with [|A| + |B]] by the usual preschool trick 
of counting things off in order. 

Generalizations: If A,, A2,A3...A, are pairwise disjoint (i.e., 4; 
A; =9 for alli ¥ 7), then 


k 


U4 


— 


k 


= 5 \Ail. 
i=l 


The proof is by induction on k. 

Example: As I was going to Saint Ives, I met a man with 7 wives, 28 
children, 56 grandchildren, and 122 great-grandchildren. Assuming these sets 
do not overlap, how many people did I meet? Answer: 1+7+28+56+122=214. 


11.1.3.1 For infinite sets 


The sum rule works for infinite sets, too; technically, the sum rule is used 
to define |A| + |B| as |AU B] when A and B are disjoint. This makes 
cardinal arithmetic a bit wonky: if at least one of A and B is infinite, then 
|A| + |B] = max(|A],|B]), since we can space out the elements of the larger 
of A and B and shove the elements of the other into the gaps. 


11.1.3.2 The Pigeonhole Principle 


A consequence of the sum rule is that if A and B are both finite and 
|A| > |B|, you can’t have an injection from A to B. The proof is by 
contraposition. Suppose f : A > B is an injection. Write A as the union 
of f-!(x) for each x € B, where f~!(x) is the set of y in A that map to 
x. Because each f~'(2x) is disjoint, the sum rule applies; but because f is 
an injection there is at most one element in each f~!(x). It follows that 
|A| = Seal f 71 (x)| < Nzeg 1 = |B|. (Question: Why doesn’t this work for 
infinite sets?) 

The Pigeonhole Principle generalizes in an obvious way to functions 
with larger domains; if f : A — B, then there is some x in B such that 
If-"()| > |AI/IBI. 
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11.1.4 Subtraction 


For any sets A and B, A is the disjoint union of AN B and A\ B. So 
|A| = |An B|+|A \ B| (for finite sets) by the sum rule. Rearranging gives 


|A\ B| =|A|—|AN Bl. (11.1.1) 


What makes (11.1.1) particularly useful is that we can use it to compute 
the size of AU B even if A and B overlap. The intuition is that if we just 
add |A| and |B|, then we count every element of AN B twice; by subtracting 
off |A 1M B| we eliminate the overcount. Formally, we have 


Theorem 11.1.2. For any finite sets A and B, 
|AU B| = |A]+|B| -|AN BI. 
Proof. Compute 


|AUB| =|ANB|+|A\ Bl +|B\ Al 
= |AN Bl + ((Al —-|AN Bl) + (Bl -|AN B)) 
= |A|+|B]—|AN Bl. 


This is a special case of the inclusion-exclusion formula, which can 
be used to compute the size of the union of many sets using the size of 
pairwise, triple-wise, etc. intersections of the sets. See §11.2.4 for the general 
rule. 


11.1.4.1 Inclusion-exclusion for infinite sets 


Subtraction doesn’t work very well for infinite quantities (while No + No = No, 
that doesn’t mean No = 0). So the closest we can get to the inclusion- 
exclusion formula is that |A| + |B] =|AU B|+|AN B|. If at least one of A 
or B is infinite, then |A U B| is also infinite, and since |AM B| < |AU B| we 
have |AU B| + |AN B| = |AU B| by the bizarre rules of cardinal arithmetic. 
So for infinite sets we have the rather odd result that |A U B| = |A| + |B] = 
max(|A|,|B|) whether the sets overlap or not. 
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11.1.4.2 Combinatorial proof 


We can prove |A|+|B| = |AU B|+|AN B| combinatorially, by turning both 
sides of the equation into disjoint unions (so the sum rule works) and then 
providing an explicit bijection between the resulting sets. The trick is that 
we can always force a union to be disjoint by tagging the elements with extra 
information; so on the left-hand side we construct L = {0} x AU {1} x B, 
and on the right-hand side we construct R= {0} x (AU B)U {1} x (ANB). 
It is easy to see that both unions are disjoint, because we are always taking 
the union of a set of ordered pairs that start with 0 with a set of ordered 
pairs that start with 1, and no ordered pair can start with both tags; it 
follows that |£| = |A|+ |B] and |R| = |AU B| + |AN B|. Now define the 
function f : L > R by the rule 


f((0,2)) = (0,2). 
f(0,2)) =(,2)ifxe BNA. 
f((,2)) = (0,2)ifxe B\ A. 


Observe that f is surjective, because for any (0,2) in {0} x (AU B), 
either x is in A and (0,x) = f((0,2)) where (0,2) € L, or x isin B\ A and 
(0,2) = f((1,x)) where (1,2) € L. It is also true that f is injective; the only 
way for it not to be is if f((0,2)) = f((1,z)) = (0,x) for some x. Suppose 
this occurs. Then x € A (because of the 0 tag) and x € B \ A (because (1, x) 
is only mapped to (0,2) if x € B\ A). But x can’t be in both A and B \ A, 
so we get a contradiction. 


11.1.5 Multiplication: the product rule 


The product rule says that Cartesian product maps to arithmetic product. 
Intuitively, we line the elements (a,b) of A x B in lexicographic order and 
count them off. This looks very much like packing a two-dimensional array 
in a one-dimensional array by mapping each pair of indices (7,7) to i-|B|+ 7. 


Theorem 11.1.3. For any finite sets A and B, 
|A x Bl = |A] - |B. 


Proof. The trick is to order A x B lexicographically and then count off 
the elements. Given bijections f : A — [|A|] and g : B — ||B|], define 
h: (Ax B) — ||A|-|B|] by the rule h((a,b)) = a-|B\ +. The division 


CHAPTER 11. COUNTING 180 


algorithm recovers a and 6 from h(a,b) by recovering the unique natural 
numbers g and r such that h(a,b) = q-|B| +r and 0 < 6 < |B| and letting 
a= f~'(q) and b=g7\(r). 


The general form is 


k k 
iL 7=1 


where the product on the left is a Cartesian product and the product on 
the right is an ordinary integer product. 


11.1.5.1 Examples 


e As I was going to Saint Ives, I met a man with seven sacks, and every 
sack had seven cats. How many cats total? Answer: Label the sacks 
0,1,2,...,6, and label the cats in each sack 0,1,2,...,6. Then each cat 
can be specified uniquely by giving a pair (sack number, cat number), 
giving a bijection between the set of cats and the set 7 x 7. Since 
|7 x 7| = 7-7 = 49, we have 49 cats. 


e Dr. Frankenstein’s trusty assistant Igor has brought him 6 torsos, 4 
brains, 8 pairs of matching arms, and 4 pairs of legs. How many 
different monsters can Dr Frankenstein build? Answer: there is a one- 
to-one correspondence between possible monsters and 4-tuples of the 
form (torso, brain, pair of arms, pair of legs); the set of such 4-tuples 
has 6-4-8-4 = 728 members. 


e How many different ways can you order n items? Call this quantity 
n! (pronounced “n factorial”). With 0 or 1 items, there is only one 
way; so we have 0! = 1! = 1. For n > 1, there are n choices for the 
first item, leaving n — 1 items to be ordered. From the product rule 
we thus have n! = n- (n — 1)!, which we can expand out as [[/_, 7, our 
previous definition of n!. 


11.1.5.2 For infinite sets 


The product rule also works for infinite sets, because we again use it as a 
definition: for any A and B, |A|-|B| is defined to be |A x B|. One oddity for 
infinite sets is that this definition gives |A|-|B| =|A| + |B] = max(|A|,|B)), 
because if at least one of A and B is infinite, it is possible to construct a 
bijection between A x B and the larger of A and B. Infinite sets are strange. 
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11.1.6 Exponentiation: the exponent rule 


Given sets A and B, let A? be the set of functions f : B > A. Then 
[4] = |||) 

If |B| is finite, this is just a |B|-fold application of the product rule: we 
can write any function f : B + A as a sequence of length |B| that gives 
the value in A for each input in B. Since each element of the sequence 
contributes |A| possible choices, we get | Aj! choices total. 

For infinite sets, the exponent rule is a definition of | A|!2I. Some simple 
facts are that n° = 2° whenever n is finite and a is infinite (this comes down 
to the fact that we can represent any element of [n] as a finite sequence of 
bits) and a” = a under the same conditions (follows by induction on n from 
a-a@=a). When a and £ are both infinite, many strange things can happen. 

To give a flavor of how exponentiation works for arbitrary sets, here’s 
a combinatorial proof of the usual arithmetic fact that 2%x? = 22+, for 
any cardinal numbers xz, a, and b. Let x = |X| and let a = |A| and 
b = |B| where A and B are disjoint (we can always use the tagging trick 
that we used for inclusion-exclusion to make A and B be disjoint). Then 
2g? = [x4 x xa and gt? — Le aee We will now construct an explicit 
bijection f : X4VP — X4x X%. The input to f is a function g: AUB > X; 
the output is a pair of functions (g4: A> X,gp: B— X). We define g4 
by ga(x) = g(x) for all x in A (this makes gy the restriction of g to A, 
usually written as g | A or g|A); similarly gg = g | B. This is easily seen to 
be a bijection; if g = h, then f(g) = (g | A,g | B) = f(h) = (h| A, | B), 
and if g #h there is some x for which g(x) 4 h(x), implying g| AAh|A 
(if isin A) org| BAh|B (if x isin B). 


11.1.6.1 Counting injections 


Counting injections from a k-element set to an n-element set corresponds 
to counting the number of ways P(n,k) we can pick an ordered subset of k 
of n items without replacement, also known as picking a k-permutation. 
(The k elements of the domain correspond to the k positions in the order.) 

There are n ways to pick the first item, n — 1 to pick the second, and so 
forth, giving a total of 


n 


n! 
Pick) = II ho Gan! 


i=n—k+1 


such k-permutations by the product rule. 
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Among combinatorialists, the notation (n), (pronounced “n lower- 
factorial k”) is more common than P(n, k) for n-(n—1)-(n—2)-....(n—k+1). 


As an extreme case we have (n),, =n-(n—1)-(n—2)-...-(n-n+]) = 
n:-(n—1)-(n—2)-...-1=n!, so n! counts the number of permutations 
of n. 


This gives us three tools for counting functions between sets: n* counts 
the number of functions from a k-element set to an n-element set, (n), counts 
the number of injections from a k-element set to an n-element set, and n! 
counts the number of bijections between two n-element sets (or from an 
n-element set to itself). 

Counting surjections is messier. If you really need to do this, you will 
need to use Stirling numbers; see [GK P94, Chapter 6] or [Sta97, p. 33). 


11.1.7 Division: counting the same thing in two different 
ways 


An old farm joke: 
Q: How do you count a herd of cattle? 
A: Count the legs and divide by four. 


Sometimes we can compute the size of a set S by using it (as an unknown 
variable) to compute the size of another set T (as a function of |.S|), and 
then using some other way to count T to find its size, finally solving for |S]. 
This is known as counting two ways and is surprisingly useful when it 
works. We will assume that all the sets we are dealing with are finite, so we 
can expect things like subtraction and division to work properly. 


11.1.7.1 Binomial coefficients 


What is |S;|? Answer: First we’ll count the number m of sequences of k 
elements of S with no repetitions. We can get such a sequence in two ways: 


1. By picking a size-k subset A and then choosing one of k! ways to order 
the elements. This gives m = |.S;,| - kl. 


2. By choosing the first element in one of n ways, the second in one 
of n — 1, the third in one of n — 2 ways, and so on until the k-th 
element, which can be chosen in one of n —k+ 1 ways. This gives 
m=(n), =n-(n—1)-(n—-2)-...(n—k+1), which can be written 
as n!/(n —k)!. (Here we are using the factors in (n — k)! to cancel out 
the factors in n! that we don’t want.) 


CHAPTER 11. COUNTING 183 


So we have m = |S;|-k! = n!/(n — k)!, from which we get 
n! 


IS] = kl-(n—k)! 


This quantity turns out to be so useful that it has a special notation: 


1)\ def n! 
k} k!-(n—k)! 


where the left-hand side is known as a binomial coefficient and is 
pronounced “n choose k.” We discuss binomial coefficients at length in §11.2. 
The secret of why it’s called a binomial coefficient will be revealed when we 
talk about generating functions in §11.3. 


11.1.7.2. Multinomial coefficients 


Here’s a generalization of binomial coefficients: let the multinomial coeffi- 


cient 
n 
Ny ng ... ME 


be the number of different ways to distribute n items among k& bins 
where the i-th bin gets exactly n; of the items and we don’t care what 
order the items appear in each bin. (Obviously this only makes sense if 
ny tng+-++++np =n.) Can we find a simple formula for the multinomial 
coefficient? 

Here are two ways to count the number of permutations of the n-element 
set: 


1. Pick the first element, then the second, etc., to get n! permutations. 
2. Generate a permutation in three steps: 


(a) Pick a partition of the n elements into blocks of size n1,n9,...p.- 
(b) Order the elements of each block. 
(c) Paste the blocks together into a single ordered list. 


n 
Ny ng ... Mk 


There are 


CHAPTER 11. COUNTING 184 


ways to pick the partition and 
ny!-no!--- nz! 


ways to order the elements of all the groups, so we have 


n 
nl = -nmyl-ngl--- nel, 
ny ng... NE 


which we can solve to get 


n n! 
Thy Tha «.. The ony ng!-++ gl 


This also gives another way to derive the formula for a binomial coefficient, 
since 


11.1.8 Applying the rules 


If you’re given some strange set to count, look at the structure of its descrip- 
tion: 


e If it’s given by a rule of the form z is in S if either P(x) or Q(z) is 
true, use the sum rule (if P and Q are mutually exclusive) or inclusion- 
exclusion. This includes sets given by recursive definitions, e.g. xz is a 
tree of depth at most k if it is either (a) a single leaf node (provided 
k > 0) or (b) a root node with two subtrees of depth at most k — 1. 
The two classes are disjoint so we have T(k) = 1+ 7(k — 1)? with 
TO} =07 


e For objects made out of many small components or resulting from 
many small decisions, try to reduce the description of the object to 
something previously known, e.g. (a) a word of length k of letters from 
an alphabet of size n allowing repetition (there are n* of them, by the 
product rule); (b) a word of length & not allowing repetition (there 
are (n), of them—or n! if n = k); (c) a subset of k distinct things 
from a set of size n, where we don’t care about the order (there are (7) 
of them); any subset of a set of n things (there are 2” of them—this 


4Of course, just setting up a recurrence doesn’t mean it’s going to be easy to actually 
solve it. 
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is a special case of (a), where the alphabet encodes non-membership 
as 0 and membership as 1, and the position in the word specifies the 
element). Some examples: 


— The number of games of Tic-Tac-Toe assuming both players keep 
playing until the board is filled is obtained by observing that each 
such game can be specified by listing which of the 9 squares are 
filled in order, giving 9! = 362880 distinct games. Note that we 
don’t have to worry about which of the 9 moves are made by X 
and which by O, since the rules of the game enforce it. (If we 
only consider games that end when one player wins, this doesn’t 
work: probably the easiest way to count such games is to send a 
computer off to generate all of them. This gives 255168 possible 
games and 958 distinct final positions.) 


— The number of completely-filled-in Tic-Tac-Toe boards can be 
obtained by observing that any such board has 5 X’s and 4 O’s. 
So there are (3) = 126 such positions. (Question: Why would this 
be smaller than the actual number of final positions?) 


Sometimes reducing to a previous case requires creativity. For example, 
suppose you win n identical cars on a game show and want to divide them 
among your & greedy relatives. Assuming that you don’t care about fairness, 
how many ways are there to do this? 


e If it’s OK if some people don’t get a car at all, then you can imagine 
putting n cars and k — 1 dividers in a line, where relative 1 gets all 
the cars up to the first divider, relative 2 gets all the cars between the 
first and second dividers, and so forth up to relative k who gets all 
the cars after the (k — 1)-th divider. Assume that each car—and each 
divider—takes one parking space. Then you have n+ & — 1 parking 
spaces with k — 1 dividers in them (and cars in the rest). There are 
exactly Co ways to do this. 

e Alternatively, suppose each relative demands at least 1 car. Then you 
can just hand out one car to each relative to start with, leaving n — k 

foe : : —k)+k-1) _ jn—1 
cars to divide as in the previous case. There are (@ ae ea ae) 
ways to do this. 


As always, whenever some counting problem turns out to have an easier 
answer than expected, it’s worth trying to figure out if there is a more 
direct combinatorial proof. In this case we want to encode assignments 
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of at least one of n cars to k people, so that this corresponds to picking 
k — 1 out of n— 1 things. One way to do this is to imagine lining up 
all n cars, putting each relative in front of one of the cars, and giving 
them that car plus any car to the right until we hit the next relative. 
In order for this to assign all the cars, we have to put the leftmost 
relative in front of the leftmost car. This leaves n — 1 places for the 
k — 1 remaining relatives, giving Ca choices. 


Finding correspondences like this is a central part of enumerative combi- 
natorics, the branch of mathematics that deals with counting things. 


11.1.9 An elaborate counting problem 


Suppose you have the numbers {1,2,...,2n}, and you want to count how 
many sequences of k of these numbers you can have that are (a) increasing 
(al?] < a[i + 1] for all 7), (b) decreasing (a[i] > ali + 1] for all 7), or (c) made 
up only of even numbers. 

This is the union of three sets A, B, and C, corresponding to the three 
cases. The first step is to count each set individually; then we can start 
thinking about applying inclusion-exclusion to get the size of the union. 

For A, any increasing sequence can be specified by choosing its elements 
(the order is determined by the assumption it is increasing). So we have 
[Al = 2). 

For B, by symmetry we have |B| = |A| = eee 

For C, we are just looking at n* possible sequences, since there are n 
even numbers we can put in each position. 

Inclusion-exclusion says that |AU BU C| = |A| + |B|+|C|-—|An B| — 
|ANC|—|BNC|+|AUBUC|. It’s not hard to see that AN B = 9 when 
k is at least 2,° so we can reduce this to |A| + |B|+|C|—|ANC|—|BnCl. 
To count AMC, observe that we are now looking at increasing sequences 
chosen from the n possible even numbers; so there are exactly (7) of them, 
and similarly for BN C. Summing up gives a total of 


(1) +e) +o") (Ce) -()) + 


sequences satisfying at least one of the criteria. 


°It’s even easier to assume that AN B = @) always, but for k = 1 any sequence is both 
increasing and nonincreasing, since there are no pairs of adjacent elements in a 1-element 
sequence to violate the property. 
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Note that we had to assume k = 2 to get AN B = ), so this formula might 
require some adjustment for k < 2. In fact we can observe immediately that 
the unique empty sequence for k = 1 fits in all of A, B, and C, so in this 
case we get 1 winning sequence (which happens to be equal to the value in 
the formula, because here AN B = 9) for other reasons), and for k = 1 we get 
2n winning sequences (which is less than the value 3n given by the formula). 

To test that the formula works for at least some larger values, let n = 3 
and k = 2. Then the formula predicts 2 ( (5) — (3)) +3? = 2(15 -3) +9 = 38 
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total sequences.° And here they are: 


I a a a ee ee ee Se ee ee ee ee 


NOT IDI OAT MT MOON TMI UOAN MY WMOOTAAN MM TOTN OD WH 1D © 
SAA DPT NNNNNN YO KOKOMO KnO HHH HHH id id id COO OOS 


SS a a ae ee a ae a a ee a a a a ae a ae ee ee a a ee ee eee 


°Without looking at the list, can you say which 3 of the 62 = 36 possible length-2 


sequences are missing? 
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11.1.10 Further reading 


Rosen [Ros12] does basic counting in Chapter 6 and more advanced counting 
(including solving recurrences and using generating functions) in chapter 8. 
Biggs [Big02] gives a basic introduction to counting in Chapters 6 and 10, 
with more esoteric topics in Chapters 11 and 12. Graham et al. [GK P94] 
have quite a bit on counting various things. 

Combinatorics largely focuses on counting rather than efficient algorithms 
for constructing particular combinatorial objects. The book Constructive 
Combinatorics, by Stanton and White, [S\V&86] remedies this omission, and 
includes algorithms not only for enumerating all instances of various classes of 
combinatorial objects but also for finding the i-th such instance in an appropri- 
ate ordering without having to generate all previous instances (unranking) 
and the inverse operation of finding the position of a particular object in an 
appropriate ordering (ranking). 


11.2 Binomial coefficients 


The binomial coefficient “n choose k”, written 


n\_ (m)_ _ n! 
(:) 7 i ~ ke (n— ky’ (11.2.1) 


counts the number of k-element subsets of an n-element set. (See §11.1.7.1 
for how to derive (11.2.1).) 
The name arises from the binomial theorem, which in the following 


form was first proved by Isaac Newton: 


Theorem 11.2.1 (Binomial theorem). For any n € R, 


(a +y)" = 3 (:;) kyr (11.2.2) 


k=0 
provided the sum converges. 


A sufficient condition for the sum converging is |x/y| < 1. For the general 
version of the theorem, (;) is defined as (n), /k!, which works even if n is 
not a non-negative integer. The usual proof requires calculus. 

In the common case when n is a non-negative integer, we can limit 


ourselves to letting & range from 0 to n. The reason is that (2) = 0 when n 
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is a non-negative integer and k > n. This gives the more familiar version 


(x+y)" = 5 (:) akyr*, (11.2.3) 


k=0 


The connection between (11.2.3) and counting subsets is straightforward: 
expanding (x + y)” using the distributive law gives 2” terms, each of which 
is a unique sequence of n x’s and y’s. If we think of the x’s in each term as 
labeling a subset of the n positions in the term, the terms that get added 
together to get «*y"—* correspond one-to-one to subsets of size k. So there 
are (j,) such terms, accounting for the coefficient on the right-hand side. 


11.2.1 Recursive definition 


If we don’t like computing factorials, we can also compute binomial coefficients 
recursively. This may actually be less efficient for large n (we need to do @(n?) 
additions instead of O(n) multiplications and divisions), but the recurrence 
gives some insight into the structure of binomial coefficients. 

Base cases: 


e If k =0, then there is exactly one zero-element set of our n-element 
set—it’s the empty set—and we have (6) = 1. 


e If k >n, then there are no k-element subsets, and we have Vk > 7: 
({,) = 0. 


Recursive step: We'll use Pascal’s identity, which says that 


() =") +O) 


The easiest proof of this identity is combinatorial, which means that we 
will construct an explicit bijection between a set counted by the left-hand 
side and a set counted by the right-hand side. This is often one of the best 
ways of understanding simple binomial coefficient identities. 

On the left-hand side, we are counting all the k-element subsets of an 
n-element set S. On the right hand side, we are counting two different 
collections of sets: the (kK — 1)-element and k-element subsets of an (n — 1)- 
element set. The trick is to recognize that we get an (n — 1)-element set S’ 
from our original set by removing one of the elements x. When we do this, 
we affect the subsets in one of two ways: 
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1. If the subset doesn’t contain x, it doesn’t change. So there is a one- 
to-one correspondence (the identity function) between k-subsets of S 
that don’t contain 2 and k-subsets of S’. This bijection accounts for 
the first term on the right-hand side. 


2. If the subset does contain x, then we get a (k — 1)-element subset 
of S’ when we remove it. Since we can go back the other way by 
reinserting x, we get a bijection between k-subsets of S$ that contain « 
and (k — 1)-subsets of S’. This bijection accounts for the second term 
on the right-hand side. 


Adding the two cases together (using the sum rule), we conclude that 
the identity holds. 

Using the base case and Pascal’s identity, we can construct Pascal’s 
triangle, a table of values of binomial coefficients: 


1 

1 1 

1 2 1 

1 3 3 #1 
1 4 6 4 1 
1 5 10 10 5 1 


Each row corresponds to increasing values of n, and each column to 
increasing values of k, with (3) in the upper left-hand corner. To compute each 
entry, we add together the entry directly above it and the entry diagonally 
above and to the left. 


11.2.1.1 Pascal’s identity: algebraic proof 


Using the binomial theorem plus a little bit of algebra, we can prove Pascal’s 
identity without using a combinatorial argument (this is not necessarily an 
improvement). The additional fact we need is that if we have two equal 


series 
[o@) CO 
do agen = D7 bpw* 
k=0 k=0 


then a, = b; for all k.7 


"This is a theorem in analysis if the series represents converges in some open interval 
around 0, and follows from the ability to extract coefficients from f(x) = par anx* by 
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Here’s the proof of Pascal’s identity: 


k=0 k=0 
=e (, eee 
res eo (an 
eee (a) 


and now we equate matching coefficients to get 
W\ head re n—-1 
k} k; k-1 


11.2.2 Vandermonde’s identity 


as advertised. 


Vandermonde’s identity says that, provided r does not exceed m or n, 


) 2G" D(H) 


11.2.2.1 Combinatorial proof 


To pick r elements of an m+n element set, we have to pick some of them 
from the first m elements and some from the second n elements. Suppose 


taking derivatives: ax = 4f (*)(Q), where f(*) = = (x). Alternatively, we can treat each 
series as a formal power series, which we think of a s an infinite sequence of coefficients 
on which we can do the usual arithmetic operations without worrying about convergence. 
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we choose k elements from the last n; there are (2) different ways to do 
this, and (oP) different ways to choose the remaining r — k from the first m. 
This gives (by the product rule) (ow) (2) ways to choose r elements from the 
whole set if we limit ourselves to choosing exactly k from the last n. The 
identity follow by summing over all possible values of k. 


11.2.2.2 Algebraic proof 


Here we use the fact that, for any sequences of coefficients {a;} and {b;}, 


i=0 \j=0 


So now consider 


m+n 

iS & - ”e =(1+2)""" 
r 

r=0 


m 


=(1+42)"(1 


+2) 
=(5()) (5 (7) 


“2 (E()6))* 


and equate terms with matching exponents. 

Is this more enlightening than the combinatorial version? It depends on 
what kind of enlightenment you are looking for. In this case the combinatorial 
and algebraic arguments are counting essentially the same things in the same 
way, so it’s not clear what if any advantage either has over the other. But in 
many cases it’s easier to construct an algebraic argument than a combinatorial 
one, in the same way that it’s easier to do arithmetic using standard grade- 
school algorithms than by constructing explicit bijections. On the other 
hand, a combinatorial argument may let you carry other things you know 
about some structure besides just its size across the bijection, giving you 
more insight into the things you are counting. The best course is probably 
to have both techniques in your toolbox. 
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11.2.3. Sums of binomial coefficients 


What is the sum of all binomial coefficients for a given n? We can show 


n 
k=0 k 

combinatorially, by observing that adding up all subsets of an n-element 
set of all sizes is the same as counting all subsets. Alternatively, apply the 
binomial theorem to (1+ 1)”. 

Here’s another sum, with alternating sign. This is useful if you want to 
know how the even-k binomial coefficients compare to the odd-k binomial 
coefficients. 


S-(-1)* a = 0.(Assuming n # 0.) 


k=0 


Proof: (1 — 1)" = 0” = 0 when n is nonzero. (When n is zero, the 0” 
part still works, since 0° = 1 = (3) (—1)°.) 
By now it should be obvious that 


n 
SF (:) = 3", 
k=0 k 
It’s not hard to construct more examples of this phenomenon. 


11.2.4 The general inclusion-exclusion formula 


We’ve previously seen that |A U B| = |A|+|B|—|AN B|. The generalization 
of this fact from two to many sets is called the inclusion-exclusion formula 
and says: 


Theorem 11.2.2. 


= S- (-1)/51+1 
SC{1...n} S40 


1) Ai 


jEs 


(11.2.4) 


i=1 


This rather horrible expression means that to count the elements in the 
union of n sets A; through A,, we start by adding up all the individual sets 
|Ai| + |A2|+...|An|, then subtract off the overcount from elements that 
appear in two sets —|A;M Ag| —|A1 M A3|—..., then add back the resulting 
undercount from elements that appear in three sets, and so on. 
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Why does this work? Consider a single element x that appears in k of 
the sets. We’ll count it as +1 in es individual sets, as —1 in (5) pairs, +1 
in ©) triples, and so on, adding up to 


yee (') a (s3-»*(")) = (s3-n*(?) _ : aja 


11.2.5 Negative binomial coefficients 


Though it doesn’t make sense to talk about the number of k-subsets of 
a (—1)-element set, the binomial coefficient (7) has a meaningful value for 
negative n, which works in the binomial theorem. We’ll use the lower-factorial 


version of the definition: 


& Saree Ul } Ik. 
jg ed 


Note we still demand that & € N; we are only allowed to do funny things 
with the upper index n. 
So for example: 


a) = (-1), /H = ( it /k! = (11 / (I1') = (1). 


An application of this fact is that 


t= 0-4 (acer = cay" = 
n=0 


l—z n=0 i n=0 


In computing this sum, we had to be careful which of 1 and —z got the 
nm exponent and which got —1 —n. If we do it the other way, we get 


1 = (1 a z)7} = S- & 1°(=2)" 2 = 5 a 
n=0 


l-z 


This turns out to actually be correct, since applying the geometric series 
formula turns the last line into 


1 1 1 1 


z 1—-1/z z-1 1-2’ 


but it’s a lot less useful. 
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What happens for a larger upper index? One way to think about (—n), is 
that we are really computing (n + k — 1), and then negating all the factors, 
which is equivalent to multiplying the whole expression by (—1)*. So this 
gives us the identity 


,(n+k—1), 
ca k! 
ah er k-1 
So, for example, 


These facts will be useful when we look at generating functions in §11.3. 


8Or just got back from reading Appendix H. 
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11.2.6 Fractional binomial coefficients 


Yes, we can do fractional binomial coefficients, too. Exercise: Find the value 


of 
1/2) _ (1/2)y 
n nt - 
Like negative binomial coefficients, these don’t have an obvious com- 


binatorial interpretation, but can be handy for computing power series of 
fractional binomial powers like 1+ z= (1+ z)!/?. 


11.2.7 Further reading 


Graham et al. [GIP94] §5.1-5.3 is an excellent source for information about 
all sorts of facts about binomial coefficients. 


11.3. Generating functions 


We’ve seen that in some cases we can use the binomial theorem to express 
infinite power series like }¢,, z” as compact expressions like pi. The compact 
representation is called a generating function of the series, and manipu- 
lating generating functions can be an efficient tool to keep track of series 
whose coefficients represent sequences of counts of combinatorial objects of 
different sizes. 


11.3.1 Basics 


A generating function represents objects of weight n with z”, and adds all 
the objects you have up to get a sum agz° + az! +a9z7+..., where each an 
counts the number of different objects of weight n. If you are very lucky (or 
constructed your set of objects by combining simpler sets of objects in certain 
straightforward ways) there will be some compact expression that is expands 
to this horrible sum but is easier to write down. Such compact expressions 
are called generating functions, and manipulating them algebraically gives 
an alternative to actually knowing how to count (Chapter 11). 


11.3.1.1 A simple example 


We are given some initial prefixes for words: qu, s, and t; some vowels to 
put in the middle: a, i, and oi; and some suffixes: d, ff, and ck, and we 
want to calculate the number of words we can build of each length. 
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One way is to generate all 27 words’ and sort them by length: 


sad sid tad tid 

quad quid sack saff sick siff soid tack taff tick tiff toid 
quack quaff quick quiff quoid soick soiff toick toiff 
quoick quoiff 


This gives us 4 length-3 words, 12 length-4 words, 9 length-5 words, and 
2 length-6 words. This is probably best done using a computer, and becomes 
expensive if we start looking at much larger lists. 

An alternative is to solve the problem by judicious use of algebra. Pretend 
that each of our letters is actually a variable, and that when we concatenate 
qu, oi, and ck to make quoick, we are really multiplying the variables 
using our usual notation. Then we can express all 27 words as the product 
(qu+s+t)(a+i+oi)(d+ff+ ck). But we don’t care about the exact set 
of words, we just want to know how many we get of each length. 

So now we do the magic trick: we replace every variable we’ve got with a 
single variable z. For example, this turns quoick into zzzzzz = z®, so we 
can still find the length of a word by reading off the exponent on z. But we 
can also do this before we multiply everything out, getting 


(zzt+2+2)\(24+2+22)(24+ 224 22) = (22 + 27)\(22 + 27)(2 + 227) 
= 29(24 z/7(1+ 22) 

2°(44 4z + 27) (14+ 22) 

= 29(4 +4 122 4+ 927 + 22°) 

= 429 + 1224+ 92° + 22°. 


We can now read off the number of words of each length directly off the 
coefficients of this polynomial. 


11.3.1.2. Why this works 


In general, what we do is replace any object of weight 1 with z. If we have 
an object with weight n, we think of it as n weight-1 objects stuck together, 
ie., 2”. Disjoint unions are done using addition as in simple counting: z+ z? 
represents the choice between a weight-1 object and a weight-2 object (which 
might have been built out of 2 weight-1 objects), while 122+ represents a 


°We are using word in the combinatorial sense of a finite sequence of letters (possibly 
even the empty sequence) and not the usual sense of a finite, nonempty sequence of letters 
that actually make sense. 
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choice between 12 different weight-4 objects. The trick is that when we 
multiply two expressions like this, whenever two values z” and z! collide, the 
exponents add to give a new value z**+! representing a new object with total 
weight k +1, and if we have something more complex like (nz*)(mz!), then 
the coefficients multiply to give (nm)z**! different weight (k + 1) objects. 

For example, suppose we want to count the number of robots we can 
build given 5 choices of heads, each of weight 2, and 6 choices of bodies, each 
of weight 5. We represent the heads by 5z? and the bodies by 6z°. When 
we multiply these expressions together, the coefficients multiply (which we 
want, by the product rule) and the exponents add: we get 52? - 6z° = 3027 
or 30 robots of weight 7 each. 

The real power comes in when we consider objects of different weights. If 
we add to our 5 weight-2 robot heads two extra-fancy heads of weight 3, and 
compensate on the body side with three new lightweight weight-4 bodies, 
our new expression is (5z? + 22%)(324 + 6z°) = 152° + 362’ + 122, giving 
a possible 15 weight-6 robots, 36 weight-7 robots, and 12 weight-8 robots. 
The rules for multiplying polynomials automatically tally up all the different 
cases for us. 

This trick even works for infinitely-long polynomials that represent infinite 
series (such “polynomials” are called formal power series). Even though 
there might be infinitely many ways to pick three natural numbers, there 
are only finitely many ways to pick three natural numbers whose sum is 
37. By computing an appropriate formal power series and extracting the 
coefficient from the z°’ term, we can figure out exactly how many ways 
there are. This works best, of course, when we don’t have to haul around 
an entire infinite series, but can instead represent it by some more compact 
function whose expansion gives the desired series. Such a function is called 
a generating function, and manipulating generating functions can be a 
powerful alternative to creativity in making combinatorial arguments. 


11.3.1.3. Formal definition 


Given a sequence ag, a1, 42,..., its generating function F(z) is given by 
the sum 


F(z)= SS ajz’. 
1i=0 


A sum in this form is called a formal power series. It is “formal” in 
the sense that we don’t necessarily plan to actually compute the sum, and 
are instead using the string of z’ terms as a long rack to store coefficients on. 
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In some cases, the sum has a more compact representation. For example, 
we have 


1 aa 
4 v} 
1-z De oe 
i=0 
so 1/(1 — z) is the generating function for the sequence 1,1,1,.... This 


may let us manipulate this sequence conveniently by manipulating the 
generating function. 

Here’s a simple case. If F(z) generates some sequence a;, what does 
sequence 6; does F'(2z) generate? The i-th term in the expansion of F'(2z) 
will be a;(2z)’ = a;2'z*, so we have b; = 2'a;. This means that the se- 
quence 1,2,4,8,16,... has generating function 1/(1—2z). In general, if F(z) 
represents a;, then F(cz) represents c’a;. 

What else can we do to fF’? One useful operation is to take its derivative 
with respect to z. We then have 


d ee eee 
ee = see : 


This almost gets us the representation for the series ia;, but the exponents 
on the z’s are off by one. But that’s easily fixed: 


So the sequence 0,1,2,3,4,... has generating function 


d 1 Zz 


“dzl—z_ (1 — 2)?’ 


and the sequence of squares 0,1,4,9,16,... has generating function 


d z z 22 


"Ee G=o2 GaP “aee> 


As you can see, some generating functions are prettier than others. 

(We can also use integration to divide each term by i, but the details are 
messier. ) 

Another way to get the sequence 0,1,2,3,4,... is to observe that it 
satisfies the recurrence: 


ea =0. 
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© On41 = An +1(Vn EN). 


A standard trick in this case is to multiply each of the Vi bits by z”, sum 
over all n, and see what happens. This gives }> an41z2" = Do anz" + 52" = 
Sanz" + 1/(1 — z). The first term on the right-hand side is the generating 
function for an, which we can call F(z) so we don’t have to keep writing it 
out. The second term is just the generating function for 1,1,1,1,1,.... But 
what about the left-hand side? This is almost the same as F(z), except the 
coefficients don’t match up with the exponents. We can fix this by dividing 
F(z) by z, after carefully subtracting off the ag term: 


(F(z) — ao)/z = (> Anz” — «) fe 
n=0 
= (>: a2") /z 
n=1 
ae" 


oo 

n 
SS An+1% - 
n=0 


So this gives the equation (F(z) —ao)/z = F(z)+1/(1—2z). Since ap = 0, 
we can rewrite this as F'(z)/z = F(z) +1/(1—). A little bit of algebra 
turns this into F(z) — zF(z) = z/(1— z) or F(z) = z/(1— z)?. 

Yet another way to get this sequence is construct a collection of objects 
with a simple structure such that there are exactly n objects with weight n. 
One way to do this is to consider strings of the form atb* where we have 
at least one a followed by zero or more b’s. This gives n strings of length 
n, because we get one string for each of the 1 through n a’s we can put in 
(an example would be abb, aab, and aaa for n = 3). We can compute the 
generating function for this set because to generate each string we must pick 
in order: 


e One initial a. Generating function = z. 
e Zero or more a’s. Generating function = 1/(1 — z). 
e Zero or more b’s. Generating function = 1/(1 — z). 


Taking the product of these gives z/(1 — z)?, as before. 
This trick is useful in general; if you are given a generating function F(z) 
for ay, but want a generating function for bp = >) p<n ax, allow yourself to 


CHAPTER 11. COUNTING 202 


pad each weight-k object out to weight n in exactly one way using n —k 
junk objects, i.e. multiply F(z) by 1/(1 — z). 


11.3.2 Some standard generating functions 


Here is a table of some of the most useful generating functions. 


Of these, the first is the most useful to remember (it’s also handy for 
remembering how to sum geometric series). All of these equations can be 
proven using the binomial theorem. 


11.3.3 More operations on formal power series and generat- 
ing functions 


Let F(z) = 4), a;2* and G(z) = 45, 6;z’. Then their sum F(z) + G(z) = 
¥7; (a; + b;)z’ is the generating function for the sequence (a; + b;). What is 
their product F(z)G(z)? 

To compute the i-th term of F'(z)G(z), we have to sum over all pairs of 
terms, one from F and one from G, that produce a z factor. Such pairs of 
terms are precisely those that have exponents that sum to 7. So we have 


F(z)G(z) = S- vi] zt 
0 


i=0 \j= 


As we’ve seen, this equation has a natural combinatorial interpretation. 
If we interpret the coefficient a; on the i-th term of F(z) as counting the 
number of “a-things” of weight 7, and the coefficient b; as the number of 
“b-things” of weight i, then the i-th coefficient of F(z)G(z) counts the number 
of ways to make a combined thing of total weight 7 by gluing together an 
a-thing and a b-thing. 


CHAPTER 11. COUNTING 203 


As a special case, if F(z) = G(z), then the i-th coefficient of F(z)G(z) = 
F?(z) counts how many ways to make a thing of total weight i using two 
“a-things”, and F(z) counts how many ways (for each i) to make a thing of 
total weight 7 using n “a-things”. This gives us an easy combinatorial proof 
of a special case of the binomial theorem: 


(oe) 
l+z)"= Q a. 
atar=5(" 

Think of the left-hand side as the generating function F(x) = 1+ 2 raised 
to the n-th power. The function F' by itself says that you have a choice 
between one weight-0 object or one weight-1 object. On the right-hand side 
the i-th coefficient counts how many ways you can put together a total of 7 
weight-1 objects given n to choose from—so it’s ("7). 


11.3.4 Counting with generating functions 


The product formula above suggests that generating functions can be used to 
count combinatorial objects that are built up out of other objects, where our 
goal is to count the number of objects of each possible non-negative integer 
“weight” (we put “weight” in scare quotes because we can make the “weight” 
be any property of the object we like, as long as it’s a non-negative integer—a 
typical choice might be the size of a set, as in the binomial theorem example 
above). There are five basic operations involved in this process; we’ve seen 
two of them already, but will restate them here with the others. 

Throughout this section, we assume that F(z) is the generating function 
counting objects in some set A and G(z) the generating function counting 
objects in some set B. 


11.3.4.1 Disjoint union 


Suppose C = AUB and A and B are disjoint. Then the generating function 
for objects in C is F(z) + G(z). 

Example: Suppose that A is the set of all strings of zero or more letters x, 
where the weight of a string is just its length. Then F(z) = 1/(1 — z), since 
there is exactly one string of each length and the coefficient a; on each z* is 
always 1. Suppose that B is the set of all strings of zero or more letters y 
and/or z, so that G(z) = 1/(1—2z) (since there are now 2" choices of length-i 
strings). The set C of strings that are either (a) all x’s or (b) made up of y’s, 
z’s, or both, has generating function F(z) + G(z) = 1/(1 — z) + 1/(1 — 2z). 
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11.3.4.2 Cartesian product 


Now let C = A x B, and let the weight of a pair (a,b) € C be the sum of 
the weights of a and 6. Then the generating function for objects in C is 
F(z)G(z). 

Example: Let A be all-x strings and B be all-y or all-z strings, as in the 
previous example. Let C be the set of all strings that consist of zero or more 
x’s followed by zero or more y’s and/or z’s. Then the generating function 
for C is F(z)G(z) = aaa 


11.3.4.3 Repetition 


Now let C consists of all finite sequences of objects in A, with the weight of 
each sequence equal to the sum of the weights of its elements (0 for an empty 
sequence). Let H(z) be the generating function for C. From the preceding 
rules we have 1 
+F+ PF? + Ro + = 
This works best when H(0) = 0; otherwise we get infinitely many weight-0 
sequences. It’s also worth noting that this is just a special case of substitution 


(see below), where our “outer” generating function is 1/(1 — z). 


Example: (0|11)* Let A= {0,11}, and let C be the set of all sequences 
of zeros and ones where ones occur only in even-length runs. Then the 
generating function for A is z + z? and the generating function for C is 
1/(1—z—2?). We can extract exact coefficients from this generating function 
using the techniques below. 


Example: sequences of positive integers Suppose we want to know 
how many different ways there are to generate a particular integer as a sum 
of positive integers. For example, we can express 4 as 4, 3+1, 2+2,2+1+1, 
14+1414+1,14+1+4+2,1+2+1, or 1+8, giving 8 different ways. 

We can solve this problem using the repetition rule. Let F = z/(1— z) 
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generate all the positive integers. Then 


1 
HA = —— 
1-F 
_ 1 
Lae 
_ l-z 
— (l-z)-z 
ot LR 
1-22" 
We can get exact coefficients by observing that 
1-2z 1 z 


je ee en ee ee 


oo oo 

S- Qn yM _ S- gn ntl 

n=0 n=0 

= 3 gnym _ - gn—lyn 
n=0 n=1 


CO 
=1+ 5° (2"- 2" 1)2” 
n=1 


co 
=14+552"712", 
n=1 


This means that there is 1 way to express 0 (the empty sum), and 2”~! 
ways to express any larger value n (e.g. 24-! — 8 ways to express 4). 

Once we know what the right answer is, it’s not terribly hard to come 
up with a combinatorial explanation. The quantity 2”~! counts the number 
of subsets of an (n — 1)-element set. So imagine that we have n — 1 places 
and we mark some subset of them, plus add an extra mark at the end; this 
might give us a pattern like XX-X. Now for each sequence of places ending 
with a mark we replace it with the number of places (e.g. XX-X = 1,1, 2, 
X--X-X---X = 1,3,2,4). Then the sum of the numbers we get is equal to n, 
because it’s just counting the total length of the sequence by dividing it up 
at the marks and the adding the pieces back together. The value 0 doesn’t 
fit this pattern (we can’t put in the extra mark without getting a sequence 
of length 1), so we have 0 as a special case again. 

If we are very clever, we might come up with this combinatorial expla- 
nation from the beginning. But the generating function approach saves us 
from having to be clever. 
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11.3.4.4 Pointing 


This operation is a little tricky to describe. Suppose that we can think of 
each weight-k object in A as consisting of k items, and that we want to count 
not only how many weight-k objects there are, but how many ways we can 
produce a weight-k object where one of its k items has a special mark on 
it. Since there are k different items to choose for each weight-k object, we 
are effectively multiplying the count of weight-k objects by k. In generating 
function terms, we have 


H(z)= ZO F(2). 


Repeating this operation allows us to mark more items (with some items 
possibly getting more than one mark). If we want to mark n distinct items 
in each object (with distinguishable marks), we can compute 


A(z) = SF (2), 


where the repeated derivative turns each term a;z’ into a;i(i — 1)(i — 
2)...(i -n+1)z*” and the 2” factor fixes up the exponents. To make the 
marks indistinguishable (i.e., we don’t care what order the values are marked 
in), divide by n! to turn the extra factor into Cy 

(If you are not sure how to take a derivative, look at §H.2.) 

Example: Count the number of finite sequences of zeros and ones where 
exactly two digits are underlined. The generating function for {0,1} is 2z, 
so the generating function for sequences of zeros and ones is F' = 1/(1 — 2z) 
by the repetition rule. To mark two digits with indistinguishable marks, we 
need to compute 

Led 1 ._i3a 2 1, 8 Az? 


2” det1—2z 2° dz(1—22)? 2° @—22)3 (1 —22)3 


11.3.4.5 Substitution 


Suppose that the way to make a C-thing is to take a weight-k A-thing and 
attach to each its k items a B-thing, where the weight of the new C-thing is 
the sum of the weights of the B-things. Then the generating function for C 
is the composition F'(G(z)). 

Why this works: Suppose we just want to compute the number of C-things 
of each weight that are made from some single specific weight-k A-thing. 
Then the generating function for this quantity is just (G(z))*. If we expand 
our horizons to include all a, weight-k A-things, we have to multiply by a; 
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to get a,(G(z))*. If we further expand our horizons to include A-things of 
all different weights, we have to sum over all k: 


(oe) 


do ax(G(z))*. 


k=0 


But this is just what we get if we start with F(z) and substitute G(z) 
for each occurrence of z, i.e. if we compute F'(G(z)). 


Example: bit-strings with primes Suppose we let A be all sequences 
of zeros and ones, with generating function F(z) = 1/(1 — 2z). Now suppose 
we can attach a single or double prime to each 0 or 1, giving 0’ or 0” or 
1’ or 1”, and we want a generating function for the number of distinct 
primed bit-strings with n attached primes. The set {',”} has generating 
function G(z) = z+ z?, so the composite set has generating function F(z) = 
1/(1 — 2(z + z7)) = 1/(1 — 2z — 227). 


Example: (0|11)* again The previous example is a bit contrived. Here’s 
one that’s a little more practical, although it involves a brief digression into 
multivariate generating functions. A multivariate generating function 
F (x,y) generates a series ar a;ju'y? , where a;; counts the number of things 
that have i x’s and 7 y’s. (There is also the obvious generalization to more 
than two variables). Consider the multivariate generating function for the 
set {0,1}, where x counts zeros and y counts ones: this is just «+ y. The 
multivariate generating function for sequences of zeros and ones is 1/(1—a2—y) 
by the repetition rule. Now suppose that each 0 is left intact but each 1 is 
replaced by 11, and we want to count the total number of strings by length, 
using 2 as our series variable. So we substitute z for 2 and 2? for y (since 
each y turns into a string of length 2), giving 1/(1 — z — z?). This gives 
another way to get the generating function for strings built by repeating 0 
and 11. 


11.3.5 Generating functions and recurrences 


What makes generating functions particularly useful for algorithm analysis 
is that they directly solve recurrences of the form T(n) = aT (n — 1) + 
bT(n—2)+ f(n) (or similar recurrences with more T terms on the right-hand 
side), provided we have a generating function F(z) for f(n). The idea is 
that there exists some generating function G(z) that describes the entire 
sequence of values T(0),7(1),7(2),..., and we just need to solve for it 


CHAPTER 11. COUNTING 208 


by restating the recurrence as an equation about G. The left-hand side 
will just turn into G. For the right-hand side, we need to shift T(n — 1) 
and T(n — 2) to line up right, so that the right-hand side will correctly 
represent the sequence T(0), 7(1), aT(0) + aT(1) + F(2), etc. It’s not hard 
to see that the generating function for the sequence 0,7(0),7(1),7(2),... 
(corresponding to the T(n—1) term) is just zG(z), and similarly the sequence 
0,0,7(1), 7(2),7(3),... (corresponding to the T(n — 2) term) is z7G(z). So 
we have (being very careful to subtract out extraneous terms at for i = 0 
and i = 1): 


G = az(G — T(0)) + b2*G 4+ (F — f(0) — zf(1)) + T(0) + zT(1), 


and after expanding F' we can in principle solve this for G as a function 
of z. 


11.3.5.1 Example: A Fibonacci-like recurrence 


Let’s take a concrete example. The Fibonacci-like recurrence 
T(n) =T(n—-1)+T(n—- 2),T(0) =1,T(1) = 1, 


becomes 
G=(246—zZ) 42764142. 


(here F = 0). 
Solving for G gives 
G=1/(l-—z- 2’). 


Unfortunately this is not something we recognize from our table, although 
it has shown up in a couple of examples. (Exercise: Why does the recurrence 
T(n) =T(n—1)+T7(n - 2) count the number of strings built from 0 and 
11 of length n?) In the next section we show how to recover a closed-form 
expression for the coefficients of the resulting series. 


11.3.6 Recovering coefficients from generating functions 


There are basically three ways to recover coefficients from generating func- 
tions: 


1. Recognize the generating function from a table of known generat- 
ing functions, or as a simple combination of such known generating 
functions. This doesn’t work very often but it is possible to get lucky. 
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2. To find the k-th coefficient of F(z), compute the k-th derivative 
d* /dz* F(z) and divide by k! to shift ag to the z° term. Then sub- 
stitute 0 for z. For example, if F(z) = 1/(1— z) then ap = 1 (no 
differentiating), aj = 1/(1 — 0)? = 1, ag = 1/(1 — 0)? = 1, etc. This 
usually only works if the derivatives have a particularly nice form or 
if you only care about the first couple of coefficients (it’s particularly 
effective if you only want ao). 


3. If the generating function is of the form 1/Q(z), where Q is a polynomial 
with Q(0) 4 0, then it is generally possible to expand the generating 
function out as a sum of terms of the form P./(1—z/c) where c is a root 
of Q (i.e. a value such that Q(c) = 0). Each denominator P, will be a 
constant if c is not a repeated root; if c is a repeated root, then P, can be 
a polynomial of degree up to one less than the multiplicity of c. We like 
these expanded solutions because we recognize 1/(1 — z/c) = i; e*2", 
and so we can read off the coefficients a; generated by 1/Q(z) as an 
appropriately weighted some of Gr Cy * etc., where the cj range over 
the roots of Q. 


Example: Take the generating function G = 1/(1 — z — 2”). We can 
simplify it by factoring the denominator: 1 — z— z* = (1—az)(1— bz) where 
1/a and 1/b are the solutions to the equation 1 — z — z? = 0; in this case 
a = (14+ V5)/2, which is approximately 1.618 and b = (1 — /5)/2, which is 
approximately —0.618. It happens to be the case that we can always expand 
1/P(z) as A/(1 — az) + B/(1 — bz) for some constants A and B whenever 
P is a degree 2 polynomial with constant coefficient 1 and distinct roots a 
and b, so 


A B 

l—az ots bz’ 

and here we can recognize the right-hand side as the sum of the generating 
functions for the sequences A- a’ and B-b'. The A- a! term dominates, so 
we have that T(n) = O(a"), where a is approximately 1.618. We can also 
solve for A and B exactly to find an exact solution if desired. 

A rule of thumb that applies to recurrences of the form T(n) = a1T(n — 
1) + agT(n—2)4+...a,T(n—k) + f(n) is that unless f is particularly large, 
the solution is usually exponential in 1/x, where x is the smallest root of 
the polynomial 1 — a,z— a2z?---— azz. This can be used to get very quick 
estimates of the solutions to such recurrences (which can then be proved 
without fooling around with generating functions). 

Exercise: What is the exact solution if T(n) = T(n —1) + T(n—2)+1? 
Or if T(n) = T(n-—1)+T(n—2)+n? 


Gx 
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11.3.6.1 Partial fraction expansion and Heaviside’s cover-up method 


There is a nice trick for finding the numerators in a partial fraction expansion. 
Suppose we have 


1 <2 § P 
(l—az)(l—bz) 1-az 1-62 


Multiply both sides by 1 — az to get 


Now plug in z = 1/a to get 


1 


—— =A+0. 
1—b/a . 


We can immediately read off A. Similarly, multiplying by 1 — bz and 
then setting 1 — bz to zero gets B. The method is known as the “cover-up 
method” because multiplication by 1 — az can be simulated by covering up 
1 — az in the denominator of the left-hand side and all the terms that don’t 
have 1 — az in the denominator in the right hand side. 

The cover-up method will work in general whenever there are no repeated 
roots, even if there are many of them; the idea is that setting 1 — qz to zero 
knocks out all the terms on the right-hand side but one. With repeated roots 
we have to worry about getting numerators that aren’t just a constant, so 
things get more complicated. We’ll come back to this case below. 


Example: A simple recurrence Suppose f(0) = 0, f(1) = 1, and for 
n > 2, f(n) = f(n—1)+2f(n — 2). Multiplying these equations by z” and 
summing over all n gives a generating function 


F(z) “Sy Or Aer ey teas yaa 


n=0 n=2 n=2 


With a bit of tweaking, we can get rid of the sums on the RHS by 
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converting them into copies of F: 


F(z) =2+ 0 f(n—1)2" 42S) f(n—2)2” 
n=2 n=2 


=z+ 3 faye? +2 3 f(n)2"? 
n=1 


n=0 


zt+z S- f(n)2z” + 227 ‘> f(n)z” 
n=1 n=0 


= z+ 2(F(z) — f(0)z°) + 22?F(z) 
= 24 zF(z) + 22*F(z). 


Now solve for F(z) to get F(x) = ->*53 = (itz) 35) =i (4 + 5); 
where we need to solve for A and B. 

We can do this directly, or we can use the cover-up method. The 
cover-up method is easier. Setting z = —1 and covering up 1+ z gives 
A = 1/(1 — 2(-—1)) = 1/3. Setting z = 1/2 and covering up 1 — 22 gives 
B=1/(1+2z) =1/(1+1/2) = 2/3. So we have 


ry — 2 , 2/32 


l+z 1—2z 
oo oo 
(=1)* n+1 262” n+1 

=e Code Yee 

n=0 3 n=0 3 

o° = n-1 ec) gn 
= OP ety oe 

n=1 n=1 
= (> - aoe) i 

n=1 3 


This gives f(0) = 0 and, for n > 1, f(n) = aS It’s not hard to 
check that this gives the same answer as the recurrence. 


Example: Coughing cows Let’s count the number of strings of each 
length of the form (M)*(O|U)*(GIH|K)* where (xly) means we can use x 
or y and * means we can repeat the previous parenthesized expression 0 or 
more times (these are examples of regular expressions). 

We start with a sequence of 0 or more M’s. The generating function for 
this part is our old friend 1/(1— z). For the second part, we have two choices 
for each letter, giving 1/(1 — 2z). For the third part, we have 1/(1 — 3z). 
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Since each part can be chosen independently of the other two, the generating 
function for all three parts together is just the product: 


1 
(1 — z)(1 — 2z)(1— 32) 


Let’s use the cover-up method to convert this to a sum of partial fractions. 


We have 
rics) , (ene) , (eae) 
(cae), (ene) , (ene 
(1 — z)(1 — 2z)(1 — 32) 1l-z 1 —2z 1 — 3z 
1 _4 9 
ome Peer ee) 
hey’ Toe} ae 


So the exact number of length-n sequences is (1/2) — 4-2" + (9/2) - 3”. 
We can check this for small n: 


n Formula Strings 

0 1/2-44+9/2=1 () 

1 1/2-84+27/2=6  M,O,U,G,H,K 

2 1/2—16481/2=25 MM,MO,MU,MG,MH,MK,OO,OU,0G,OH,OK,UO, 
UU,UG,UH,UK,GG,GH,GK,HG,HH,HK,KG,KH, KK 

3 1/2—32+4 243/2=90 (exercise) 


Example: A messy recurrence Let’s try to solve the recurrence T'(n) = 
AT(n — 1) + 12T(n — 2) +1 with T(0) =0 and T(1) =1. 
Let F= SOT (n)z2”. 


Summing over all n gives 


P= s T(n)2” = T(0)2° + T(z +4 > T(n—1)z"4+12 3 T(n—2)z"+ 3 Leg" 
n=0 


n=2 n=2 =e 


[oe) [o-e) [oe) 
=z+4z S- T(n)2” + 1227 y T(n)2" + 2? S- Zz 
n=1 n=0 


n=0 
2 


=2+42(F —T(0)) + 12°F + = : 


3 


=24+42F4122°F+ 7 


Solving for F' gives 


(2+ 4) 
os ey eer 
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We want to solve this using partial fractions, so we need to factor (1 — 
Az — 122%) = (1+ 2z)(1 — 6z). This gives 


a 
(1 + 2z)(1 — 6z) 


z 2? 


(1+ 22) —62) ° @—z) +22) — 62) 


- lism 0=0(-B) “Oe o) =a) 


1 i 1 
+? (qoatgacs | (1 - (-3)) @ +22) (1-6 (-3)) | (1-%) (1+2(4))G 3) 
ae Bai 4 takes. Nae 
__4 ,_4 aa ee OO p10 
1+2z 1-62 l-z 14+2z 1-62 


From this we can immediately read off the value of T(n) for n > 2: 


3 1 1 9 
T S(O n—-1 gr 9 n—2 n—-2 
a oar a is t 6°?) 10° 
1 1 1 1 1 
= 2)” 6” 2)" 4 6” 
eye 8 15 Bi a ) AO 
3 1 1 
= 6” 2)” ; 
20 ip! ) 15 


Let’s check this against the solutions we get from the recurrence itself: 


T(n) 

0 

1 

1+4-14+12-0=5 
14+4-5412-1=33 
14+4-33412-5=193 


RwWNrH OS 


We'll try n = 3, and get T(3) = (3/20) - 216 + 8/12 —1/15 = (3-3-2164 
40 — 4) /60 = (1944 + 40 — 4)/60 = 1980/60 = 33. 

To be extra safe, let’s try T(2) = (3/20) - 36 — 4/12 — 1/15 = (3-3-36— 
20 — 4) /60 = (324 — 20 — 4)/60 = 300/60 = 5. This looks good too. 

The moral of this exercise? Generating functions can solve ugly-looking 
recurrences exactly, but you have to be very very careful in doing the math. 
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11.3.6.2 Partial fraction expansion with repeated roots 


Let an = 2an_1 +7, with some constant a9. We'd like to find a closed-form 
formula for ay. 
As a test, let’s figure out the first few terms of the sequence: 


ag = ao 

a, = 2a9+1 

a2 = 4a9p+2+2= 4ayp + 4 

a3 = 8a9+8+3= 8a9 + 11 

a4 = 16a9 + 22+4= 16a9 + 26 


The ao terms look nice (they’re 2"a9), but the 0, 1, 4, 11, 26 sequence 
doesn’t look like anything familiar. So we’ll find the formula the hard way. 

First we convert the recurrence into an equation over generating functions 
and solve for the generating function F: 


x G2" = 25 Any” + ye ne” + ao 


z 
ie CREE ate 
x 
Lene E ae ye 
pS z + ag 


(l—z)2(1-2z) 1-22 


Observe that the right-hand term gives us exactly the 2”a9 terms we 
expected, since 1/(1 — 2z) generates the sequence 2”. But what about the 
left-hand term? Here we need to apply a partial-fraction expansion, which is 
simplified because we already know how to factor the denominator but is 
complicated because there is a repeated root. 

We can now proceed in one of two ways: we can solve directly for the 
partial fraction expansion, or we can use an extended version of Heaviside’s 
cover-up method that handles repeated roots using differentiation. We'll 
start with the direct method. 


Solving for the PFE directly Write 


1 _ A B 
(1 — z)2(1 — 2z) ar ener 


We expect B to be a constant and A to be of the form A,z+ Ao. 
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To find B, use the technique of multiplying by 1 —2z and setting z = 1/2: 


1 A-0 
= B. 
ah 0-2 * 


So B=1/(1 — 1/2)? =1/(1/4) = 4. 
We can’t do this for A, but we can solve for it after substituting in B = 4: 


1 _ A 4 
(1 —z)?(1—2z) (=) 1a 
1 = A(1 — 2z) + 4(1 - z)? 

1-—4(1-—z)? 
i a 
7 1—4+4 82-42? 
. 1—2z 
_ 34 82-42? 
— 1 —2z 

“(1326 22) 
_ 1 —2z 
= 2z-—3. 


A= 


So we have the expansion 


1 22-3 4 


(=A 92) see 1 93’ 


from which we get 


Zz ao 

Fr= 
(G70 =2) ' 1-% 
_ 222 — 3z ct BP ao 


~ (l—z)? | [soy soe 


If we remember that 1/(1— z)? generates the sequence rz, =n +1 and 
1/(1 — 2z) generates x, = 2”, then we can quickly read off the solution (for 


large n): 
Gg = 21) = Baa ag 2S 9 S20 =y 


which we can check by plugging in particular values of n and comparing 
it to the values we got by iterating the recurrence before. 
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The reason for the “large n” caveat is that z*/(1 — z)? doesn’t generate 
precisely the sequence ©, = n—1, since it takes on the values 0,0,1,2,3,4,... 
instead of —1,0,1,2,3,4,.... Similarly, the power series for z/(1 — 2z) does 
not have the coefficient 2”~! = 1/2 when n = 0. Miraculously, in this 
particular example the formula works for n = 0, even though it shouldn’t: 
2(n — 1) is —2 instead of 0, but 4-2"~! is 2 instead of 0, and the two errors 
cancel each other out. 


Solving for the PFE using the extended cover-up method It is also 
possible to extend the cover-up method to handle repeated roots. Here we 
choose a slightly different form of the partial fraction expansion: 


1 A B C 


(1—z)2(1-—2z) (1-2)? i oe 

Here A, B, and C are all constants. We can get A and C by the cover-up 
method, where for A we multiply both sides by (1 — z)? before setting z = 1; 
this gives A = 1/(1—2) = —1 and C = 1/(1— 4)? =4. For B, if we multiply 
both sides by (1 — z) we are left with A/(1— z) on the right-hand side and a 
(1 — z) in the denominator on the left-hand side. Clearly setting z = 1 in 
this case will not help us. 

The solution is to first multiply by (1 — z)? as before but then take a 
derivative: 


f . sg Bg, JC 
(lM seb 37) Say Lae oe 
te CuS2)? 
fape eee 1—2z 
do Vs -@ C(1 — z)? 
eee abet 1—2z 
= = “9 2 
2 = hip 2C(1—z) | 2C(1—- 2) 
(1 — 22)? 1—2z (1 —2z)? 


Now if we set z = 1, every term on the right-hand side except —B 
becomes 0, and we get —B = 2/(1— 2)? or B= —2. 
Plugging A, B, and C into our original formula gives 


1 —1 —2 4 


(1 — z)2(1 — 2z) G2?  t=e ' 1-22’ 


and thus 


z ao —1 —2 4 ) ag 
F= = 
(1—z)2(1—2z) | 1-22 (; i o 
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From this we can read off (for large n): 
(S400 ag 84 eS 


We believe this because it looks like the solution we already got. 


11.3.7 Asymptotic estimates 


We can simplify our life considerably if we only want an asymptotic estimate 
of a, (see Chapter 7). The basic idea is that if a, is non-negative for 
sufficiently large n and }>anz” converges for some fixed value z, then ap, 
must be o(z~”) in the limit. (Proof: otherwise, a,,z" is at least a constant 
for infinitely many n, giving a divergent sum.) So we can use the radius 
of convergence of a generating function F(z), defined as the largest value 
r such that F(z) is defined for all (complex) z with |z| < r, to get a quick 
estimate of the growth rate of F’s coefficients: whatever they do, we have 
i. =O"). 

For generating functions that are rational functions (ratios of polyno- 
mials), we can use the partial fraction expansion to do even better. First 
observe that for F(z) = > fiz” = 1/(1 — az)*, we have fp = (ae = 


ios et Yar = @(a"n*-1). Second, observe that the numera- 


tor is irrelevant: if 1/(1 — az)* = @(a"n*-!) then bz™/(1 — az)F-! = 
bO(a”—™(n — m)*-!) = ba-™(1 —m/n)*-1O(a"n*-1) = O(a"n*-1), because 
everything outside the O disappears into the constant for sufficiently large n. 
Finally, observe that in a partial fraction expansion, the term 1/(1 — az)* 
with the largest coefficient a (if there is one) wins in the resulting asymptotic 
sum: O(a”) + O(b") = O(a”) if |a| > |b]. So we have: 


Theorem 11.3.1. Let F(z) = > fnz” = P(z)/Q(z) where P and Q are 
polynomials in z. If Q has a root r with multiplicity k, and all other roots s 
of Q satisfy |r| < |s|, then fy = O((1/r)"n*-}). 


The requirement that r is a unique minimal root of Q is necessary; for 
example, F(z) = 2/(1 — 22) =1/(1— z)+1/(1+ z) generates the sequence 
0,2,0,2,..., which is not O(1) because of all the zeros; here the problem is 
that 1— z? has two roots with the same absolute value, so for some values of 
n it is possible for them to cancel each other out. 

A root in the denominator of a rational function F' is called a pole. 
So another way to state the theorem is that the asymptotic value of the 
coefficients of a rational generating function is determined by the smallest 
pole. 
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More examples: 


F(z) Smallest pole Asymptotic value 
1/(1—2z) 1 O(1) 

1/(1—z)? 1, multiplicity 2 O(n) 

ijl —e= 27) (V5 —1)/2=2/(1+ V5) O(((1+ V5)/2)”) 
1/((1 — z)(1 — 2z)(1 — 32)) 1/3 Q(3") 

(z+ 27(1—z))/(1—4z2—12z7) 1/6 @(6") 

1/((1 — z)?(1 — 2z)) 1/2 Q(2”) 


In each case it may be instructive to compare the asymptotic values to 
the exact values we obtained earlier. 


11.3.8 Recovering the sum of all coefficients 


Given a generating function for a convergent series )7; a;z’, we can compute 
the sum of all the a; by setting z to 1. Unfortunately, for many common 
generating functions setting z = 1 yields 0/0 (if it yields something else 
divided by zero then the series diverges). In this case we can recover the 
correct sum by taking the limit as z goes to 1 using L’Hépital’s rule, which 
says that lima+c f(x) /g(x) = limz-+c f’(x)/g'(x) when the latter limit exists 
and either f(c) = g(c) =0 or f(c) = g(c) =~." 


11.3.8.1 Example 


Let’s derive the formula for 1+2-+---+n. We’ll start with the generating 
function for the series 7?) 2’, which is (1 — z” + 1)/(1 — z). Applying the 


d : 
za, method gives us 


OE os gj 2e"4 

4 ,@l—2" 
i ~ ae l-<z 
1=0 


_ 1 (n+1)z2” a 
—*\ ip? Lae - (<2 
z—(n+1)2%1 4 nz"? 

(=2) , 


The justification for doing this is that we know that a finite sequence really has 
n+1 

a finite sum, so the “singularity” appearing at z = 1 in e.g. ee is an artifact of 
the generating-function representation rather than the original series—it’s a “removable 


singularity” that can be replaced by the limit of f(x)/g(x) as > c. 
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Plugging z = 1 into this expression gives (1 — (n+ 1)+n)/(1—1) = 0/0, 
which does not make us happy. So we go to the hospital—twice, since one 
application of L’H6pital’s rule doesn’t get rid of our 0/0 problem: 


— 2a (nt let ene 2 1 —(nt1)22"4 n(n4+2)2"*1 
lim = lim 
zl (1 — z)? z>1 —2(1 — z) 

et —n(n + 1)?22"-1 + n(nt 1)(n + 2)2” 

zl 2 

_ =n(n +1)? + n(n 4+ 1)(n+ 2) 

7 ps 

_ —n3 — In? —n+n3 4 3n?+2n 

7 2 

—Wtn n(n+1) 

a er 


which is our usual formula. Gauss’s childhood proof is a lot quicker, but the 
generating-function proof is something that we could in principle automate 
most of the work using a computer algebra system, and it doesn’t require 
much creativity or intelligence. So it might be the weapon of choice for 
nastier problems where no clever proof comes to mind. 

More examples of this technique can be found in §11.2, where the binomial 
theorem applied to (1 + x)” (which is really just a generating function for 
> (4)z") is used to add up various sums of binomial coefficients. 


11.3.9 A recursive generating function 


Let’s suppose we want to count binary trees with n internal nodes. We can 
obtain such a tree either by (a) choosing an empty tree (g.f.: 2? = 1); or 
(b) choosing a root with weight 1 (g.f. 1-z! = z), since we can choose it in 
exactly one way), and two subtrees (g.f. = F? where F is the g.f. for trees). 
This gives us a recursive definition 


F=1+2F". 
Solving for F’ using the quadratic formula gives 


1ltJ/1-—4z 


22 


That 2z in the denominator may cause us trouble later, but let’s worry 
about that when the time comes. First we need to figure out how to extract 
coefficients from the square root term. 
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The binomial theorem says 


220 


feet (777) sey 


n=0 


For n > 1, we can expand out the Ce) terms as 


(1/2) n 


(’?) = 
n} on 


(—1)” 


(2n — 1)! 


~ 22n-1(2n—1)  nl(n—1)! 


ye es 
~ 22n-1(2n — 1) 


, ae ‘) 


For n = 0, the switch from the big product of odd terms to (2n — 2)! 
divided by the even terms doesn’t work, because (2n — 2)! is undefined. So 
here we just use the special case (1?) =| 
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Now plug this nasty expression back into F' to get 


LltvVJ1—4z 


Qz 
i 3 See ye 

Bet () 
Dt, Ut (-1)""1 f2n-1 e 

ie 22 pD 22n-1(2n — 1) ( n ics 

ob EE eee nS 1 

~ 22> S | n ): 


it ee ea op 
28 eal n )e") 


n 


{9 @- - i Wn ao; 
2 Y aal n ): ‘ 


na=1 


i) 


1 a 1 2 1 
= x ( ye ( ise )") 
2% 2z 4-4 (2n+1)\n+1 


Here we choose minus for the plus-or-minus to get the right answer and 
then do a little bit of tidying up of the binomial coefficient. 
We can check the first few values of f(n): 


n f(n) 

0 ()=1 

1 (1/2)G) =1 

2 G30) =63= 
3 (1/4)(s) = 20/4 = 


and these are consistent with what we get if we draw all the small binary 
trees by hand. 

The numbers a? ") show up in a lot of places in combinatorics, and 
are known as the Catalan numbers. 
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11.3.10 Summary of operations on generating functions 


The following table describes all the nasty things we can do to a generating 
function. Throughout, we assume F = 7 f,2", G = ST gnz*, ete. 


Operation 


Generating functions 


Coefficients 


Combinatorial in- 
terpretation 


Find fo 
Find fy, 
Flatten 
Shift right 


Shift left 


Pointing 


Sum 
Product 
Composition 


Repetition 


fo = F(0) 


k 
fe = poe F (®|e=0 


G=70 =P) 


Returns fo 
Returns fy 
Computes \> fz, 
9k = Se-1 


Gk = Fev 


Gg =k fr 


hy = fr+9r 
hy = do; fige-i 
=> fpG" 


Gas Pp" 


Count weight 0 ob- 
jects. 

Count weight k ob- 
jects. 

Count all objects, 
ignoring weights. 
Add 1 to all 
weights. 

Subtract 1 from all 
weights, after re- 
moving any weight- 
0 objects. 

A G-thing is an F- 
thing with a label 
pointing to one of 
its units. 

Disjoint union. 
Cartesian product. 
To make an 4H- 
thing, first choose 
an F-thing of 
weight m, then 
bolt onto it m 
G-things. The 
weight of the 
H-thing is the sum 
of the weights of 
the G-things. 

A G+thing is a 
sequence of zero 
or more F-things. 
Note: this is just a 
special case of com- 
position. 
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11.3.11 Variants 


The exponential generating function or egf for a sequence ao,... is 
given by F(z) = D> a,z"/n!. For example, the egf for the sequence 1,1,1,... 
is e* = > z”"/n!. Exponential generating functions admit a slightly different 
set of operations from ordinary generating functions: differentiation gives 
left shift (since the factorials compensate for the exponents coming down), 
multiplying by z gives b, = naj4i1, etc. The main application is that 
the product F(z)G(z) of two egf’s gives the sequence whose n-th term is 
> ({)arbn—k; so for problems where we want that binomial coefficient in 
the convolution (e.g. when we are building weight n objects not only by 
choosing a weight-k object plus a weight-(n — k) object but also by arbitrarily 
rearranging their unit-weight pieces) we want to use an egf rather than an 
ogf. We won’t use these in CS202, but it’s worth knowing they exist. 

A probability generating function or pgf is essentially an ordinary 
generating function where each coefficient a, is the probability that some 
random variable equals n. See §12.2 for more details. 


11.3.12 Further reading 


Rosen [Ros12] discusses some basic facts about generating functions in §8.4. 
Graham et al. [GI P94] give a more thorough introduction. Herbert Wilf’s 
book generatingfunctionology, which can be downloaded from the web, will 
tell you more about the subject than you probably want to know. 


Chapter 12 


Probability theory 


Here are two examples of questions we might ask about the likelihood of 
some event: 


e Gambling: I throw two six-sided dice, what are my chances of seeing a 
7? 


e Insurance: I insure a typical resident of Smurfington-upon-Tyne against 
premature baldness. How likely is it that I have to pay a claim? 


Answers to these questions are summarized by a probability, a number 
in the range 0 to 1 that represents the likelihood that some event occurs. 
There are two dominant interpretations of this likelihood: 


e The frequentist interpretation says that if an event occurs with 
probability p, then in the limit as I accumulate many examples of 
similar events, I will see the number of occurrences divided by the 
number of samples converging to p. For example, if I flip a fair coin 
over and over again many times, I expect that heads will come up 
roughly half of the times I flip it, because the probability of coming up 
heads is 1/2. 


e The Bayesian interpretation says that when I say that an event 
occurs with probability p, that means my subjective beliefs about the 
event would lead me to take a bet that would be profitable on average 
if this were the real probability. So a Bayesian would take a double- 
or-nothing bet on a coin coming up heads if they believed that the 
probability it came up heads was at least 1/2. 


Frequentists and Bayesians have historically spent a lot of time arguing 
with each other over which interpretation makes sense. The usual argument 
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against frequentist probability is that it only works for repeatable experi- 
ments, and doesn’t allow for statements like “the probability that it will rain 
tomorrow is 50%” or the even more problematic “based on what I know, 
there is a 50% probability that it rained yesterday.” The usual argument 
against Bayesian probability is that it’s hopelessly subjective—it’s possible 
(even likely) that my subjective guesses about the probability that it will 
rain tomorrow are not the same as yours. ! 

As mathematicians, we can ignore such arguments, and treat probability 
axiomatically as just another form of counting, where we normalize every- 
thing so that we always end up counting to exactly 1. It happens to be the 
case that this approach to probability works for both frequentist interpre- 
tations (assuming that the probability of an event measures the proportion 
of outcomes that cause the event to occur) and Bayesian interpretations 
(assuming our subjective beliefs are consistent). 


12.1 Events and probabilities 


We'll start by describing the basic ideas of probability in terms of probabilities 
of events, which either occur or don’t. Later we will generalize these ideas 
and talk about random variables, which may take on many different values 
in different outcomes. 


12.1.1 Probability axioms 


Coming up with axioms for probabilities that work in all the cases we want 
to consider took much longer than anybody expected, and the current set in 
common use only go back to the 1930’s. Before presenting these, let’s talk a 
bit about the basic ideas of probability. 

An event A is something that might happen, or might not; it acts like a 
predicate over possible outcomes. The probability Pr[A] of an event A is 
a real number in the range 0 to 1, that must satisfy certain consistency rules 
like Pr [=A] = 1 — Pr[A]. 

In discrete probability, there is a finite set of atoms, each with an 
assigned probability, and every event is a union of atoms. The probability 
assigned to an event is the sum of the probabilities assigned to the atoms 
it contains. For example, we could consider rolling two six-sided dice. The 


' This caricature of the debate over interpreting probability is thoroughly incomplete. 
For a thoroughly complete discussion, including many other interpretations, see http: 
//plato.stanford.edu/entries/probability-interpret/. 
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atoms are the pairs (7,7) that give the value on the first and second die, and 
we assign a probability of 1/36 to each pair. The probability that we roll 
a 7 is the sum of the cases (1,6), (2,5), (3,4), (4,3), (5,2), and (6,1), or 
6/36 = 1/6. 

Discrete probability doesn’t work if we have infinitely many atoms. Sup- 
pose we roll a pair of dice infinitely many times (e.g., because we want to 
know the probability that we never accumulate more 6’s than 7’s in this 
infinite sequence). Now there are infinitely many possible outcomes: all the 
sequences of pairs (7,7). If we make all these outcomes equally likely, we 
have to assign each a probability of zero. But then how do we get back to a 
probability of 1/6 that the first roll comes up 7? 


12.1.1.1 The Kolmogorov axioms 


A triple (Q,F, P) is a probability space if 2 is a set of outcomes (where 
each outcome specifies everything that ever happens, in complete detail); F 
is a sigma-algebra, which is a family of subsets of 2, called measurable 
sets, that is closed under complement (i.e., if A is in F then 2 \ A is in F) 
and countable union (union of Aj, Ag,... is in F if each set A; is); and P 
is a probability measure that assigns a number in [0,1] to each set in F. 
The measure P must satisfy three axioms, due to Kolmogorov [Iol33}]: 


1. P(A) > 0 for all AE F. 
2. PIO} = 1, 


3. For any sequence of pairwise disjoint events A;, Az, A3,..., P(UA;) = 
DL P(Ai). 


From these one can derive rules like P(Q \ A) = 1 — P(A) etc. 

Most of the time, ( is finite, and we can just make F include all subsets 
of Q, and define P(A) to be the sum of P({x}) over all x in A. This gets us 
back to the discrete probability model we had before. 

Unless we are looking at multiple probability spaces or have some partic- 
ular need to examine 2, F, or P closely, we usually won’t bother specifying 
the details of the probability space we are working in. So most of the time we 
will just refer to “the” probability Pr [A] of an event A, bearing in mind that 
we are implicitly treating A as a subset of some implicit that is measurable 
with respect to an implicit # and whose probability is really P(A) for some 
implicit measure P. 
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12.1.1.2 Examples of probability spaces 


e = {H,T}, F= PQ) = {LF {H} {7}, {H, TH}, Pr[A] = |A]/2. This 
represents a fair coin with two outcomes H and T that each occur with 
probability 1/2. 


Q = {H,T}, F = P(Q), Pr [{H}] = p, Pr [{T}] =1—p. This represents 
a biased coin, where H comes up with probability p. 


Q = {(4,7) | 4,9 € {1,2,3,4,5,6}}, F = P(Q), Pr[A] = |A]/36. Roll of 
two fair dice. A typical event might be “the total roll is 4”, which is 
the set {(1,3), (2,2), (3,1)} with probability 3/36 = 1/12. 


Q=N, F =P(Q), Pr[A] = ne427" 1. This is an infinite probability 
space; a real-world process that might generate it is to flip a fair coin 
repeatedly and count how many times it comes up tails before the first 
time it comes up heads. Note that even though it is infinite, we can 
still define all probabilities by summing over atoms: Pr [{0}] = 1/2, 
Pr [{1}] = 1/4, Pr [{0,2,4,...}] =1/2+1/8+1/32+.---= 2/3, etc. 


It’s unusual for anybody doing probability to actually write out the 
details of the probability space like this. Much more often, a writer will just 
assert the probabilities of a few basic events (e.g. Pr [{H}] = 1/2), and claim 
that any other probability that can be deduced from these initial probabilities 
from the axioms also holds (e.g. Pr[{T}] = 1 — Pr[{H}] = 1/2). The main 
reason Kolmogorov gets his name attached to the axioms is that he was 
responsible for Kolmogorov’s extension theorem, which says (speaking 
very informally) that as long as your initial assertions are consistent, there 
exists a probability space that makes them and all their consequences true. 


12.1.2 Probability as counting 


The easiest probability space to work with is a uniform discrete probabil- 
ity space, which has N outcomes each of which occurs with probability 1/N. 
If someone announces that some quantity is “random” without specifying 
probabilities (especially if that someone is a computer scientist), the odds 
are that what they mean is that each possible value of the quantity is equally 
likely. If that someone is being more careful, they would say that the quantity 
is “drawn uniformly at random” from a particular set. 

Such spaces are among the oldest studied in probability, and go back 
to the very early days of probability theory where randomness was almost 
always expressed in terms of pulling tokens out of well-mixed urns, because 
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such 


“urn models” were one of the few situations where everybody agreed 


on what the probabilities should be. 


12.1.2.1 Examples 


A random bit has two outcomes, 0 and 1. Each occurs with probability 
1/2. 


A die roll has six outcomes, 1 through 6. Each occurs with probability 
1/6. 


A roll of two dice has 36 outcomes (order of the dice matters). Each 
occurs with probability 1/36. 


A random n-bit string has 2” outcomes. Each occurs with probability 
2~-". The probability that exactly one bit is a 1 is obtained by counting 
all strings with a single 1 and dividing by 2”. This gives n27~”. 


A poker hand consists of a subset of 5 cards drawn uniformly at 
random from a deck of 52 cards. Depending on whether the order of 
the 5 cards is considered important (usually it isn’t), there are either 
(2) or (52), possible hands. The probability of getting a flush (all five 
cards in the hand drawn from the same suit of 13 cards) is 4('2) / (2); 
there are 4 choices of suits, and (2) ways to draw 5 cards from each 


suit. 


A random permutation on n items has n! outcomes, one for each 
possible permutation. A typical event might be that the first element 
of a random permutation of 1...n is 1; this occurs with probability 
(n—1)!/n! = 1/n. Another example of a random permutation might be 
a uniform shuffling of a 52-card deck (difficult to achieve in practice!). 
Here, the probability that we get a particular set of 5 cards as the first 
5 in the deck is obtained by counting all the permutations that have 
those 5 cards in the first 5 positions (there are 5! - 47! of them) divided 
by 52!. The result is the same 1/ (4 that we get from the uniform 
poker hands. 


12.1.3 Independence and the intersection of two events 


Events A and B are independent if Pr[AM B] = Pr[A]-Pr[Bj]. In general, 
a set of events {A;} is independent if each A; is independent of any event 
defined only in terms of the other events. 
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It can be dangerous to assume that events are independent when they 
aren’t, but quite often when describing a probability space we will explic- 
itly state that certain events are independent. For example, one typically 
describes the space of random n-bit strings (or n coin flips) by saying that 
one has n independent random bits and then deriving that each particular 
sequence occurs with probability 2~” rather than starting with each sequence 
occurring with probability 2~” and then calculating that each particular bit 
is 1 with independent probability 1/2. The first description makes much 
more of the structure of the probability space explicit, and so is more directly 
useful in calculation. 


12.1.3.1 Examples 


e What is the probability of getting two heads on independent fair 
coin flips? Calculate it directly from the definition of independence: 
Pr [Hy 9 Ag] = (1/2)(1/2) = 1/4. 


e Suppose the coin-flips are not independent (maybe the two coins are 
glued together). What is the probability of getting two heads? This 
can range anywhere from zero (coin 2 always comes up the opposite of 
coin 1) to 1/2 (if coin 1 comes up heads, so does coin 2). 


e What is the probability that both you and I draw a flush (all 5 cards 
the same suit) from the same poker deck? Since we are fighting over 
the same collection of same-suit subsets, we’d expect Pr{[AN B] # 
Pr [A] - Pr |B]—the event that you get a flush (A) is not independent 
of the event that I get a flush (B), and we’d have to calculate the 
probability of both by counting all ways to draw two hands that are 
both flushes. But if we put your cards back and then shuffle the deck 
again, the events in this new case are independent, and we can just 
square the Pr [flush] that we calculated before. 


e Suppose the Red Sox play the Yankees. What is the probability that 
the final score is exactly 4-4? Amazingly, it appears that it is equal to? 
Pr [Red Sox score 4 runs against the Yankees] 
-Pr [Yankees score 4 runs against the Red Sox]. 
To the extent we can measure the underlying probability distribution, 


the score of each team in a professional baseball game appears to be 
independent of the score of the other team. 


See http://arXiv. org/abs/math/0509698. 
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12.1.4 Union of events 


What is the probability of AU B? If A and B are disjoint, then the axioms 
give Pr[AU B] = Pr[A] + Pr[B]. But what if A and B are not disjoint? 

By analogy to inclusion-exclusion in counting we would expect that 

Pr [AU B] = Pr[A] + Pr[B] — Pr[An B]. 

Intuitively, when we sum the probabilities of A and B, we double-count 
the event that both occur, and must subtract it off to compensate. To prove 
this formally, consider the events AN B, AN =B, and ~=A/M B. These are 
disjoint, so the probability of the union of any subset of this set of events is 
equal to the sum of its components. So in particular we have 
Pr [A] + Pr[B] — Pr[An B] 

= (Pr[An B] + Pr[An-B)) + (Pr[An B] + Pr[=An B)) — Pr{[An B] 
= Pr[An B] + Pr[An-B] + Pr[AAn B] 
= Pr[AUB]. 


12.1.4.1 Examples 
e What is the probability of getting at least one head out of two indepen- 
dent coin-flips? Compute Pr [Hy U Ho] = 1/24 1/2 —(1/2)(1/2) = 3/4. 


e What is the probability of getting at least one head out of two coin-flips, 
when the coin-flips are not independent? Here again we can get any 
probability from 0 to 1, because the probability of getting at least one 
head is just 1 — Pr [T, N To]. 


For more events, we can use a probabilistic version of the inclusion- 
exclusion formula (Theorem 11.2.2). The new version looks like this: 


Theorem 12.1.1. Let A,... An be events on some probability space. Then 


Pr U 4, = So. (-2!§' pr | () A; 
w=1 


SC{1...n}, S40 jes 


, (12.1.1) 


For discrete probability, the proof is essentially the same as for The- 
orem 11.2.2; the difference is that instead of showing that we add 1 for 
each possible element of () A;, we show that we add the probability of each 
outcome in (| A;. The result continues to hold for more general spaces, but 
requires a little more work.® 


3The basic idea is to chop () A; into all sets of the form U B; where each B; is either 
A; or —A;; this reduces to the discrete case. 
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12.1.5 Conditional probability 


Suppose I want to answer the question “What is the probability that my dice 
add up to 6 if I know that the first one is an odd number?” This question 
involves conditional probability, where we calculate a probability subject 
to some conditions. The probability of an event A conditioned on an event 
B, written Pr[A | B], is defined by the formula 


Pr[An Bi 


Pr[A| Bl = 5 


One way to think about this is that when we assert that B occurs we are 
in effect replacing the entire probability space with just the part that sits in 
B. So we have to divide all of our probabilities by Pr [B] in order to make 
Pr [|B | B] =1, and we have to replace A with AN B to exclude the part of 
A that can’t happen any more. 

Note also that conditioning on B only makes sense if Pr[B] > 0. If 
Pr [|B] = 0, Pr [A | B] is undefined. 


12.1.5.1 Conditional probabilities and intersections of non-independent 
events 


Simple algebraic manipulation gives 
Pr[An B] = Pr[A| B]- Pr[B]. 


So one of the ways to compute the probability of two events occurring 
is to compute the probability of one of them, and the multiply by the 
probability that the second occurs conditioned on the first. For example, 
if my attempt to reach the summit of Mount Everest requires that I first 
learn how to climb mountains (Pr[{[B] = 0.1) and then make it to the 
top safely (Pr[A |B] = 0.9), then my chances of getting to the top are 
Pr [AN B] = Pr[A | B]- Pr[B] = (0.9)(0.1) = 0.09. 

We can do this for sequences of events as well. Suppose that I have an 
urn that starts with & black balls and 1 red ball. In each of n trials I draw 
one ball uniformly at random from the urn. If it is red, I give up. If it is 
black, I put the ball back and add another black ball, thus increasing the 
number of balls by 1. What is the probability that on every trial I get a 
black ball? 

Let A; be the event that I get a black ball in each of the first 7 trials. 
Then Pr [Ao] = 1, and for larger i we have Pr[A;] = Pr [A; | Aj—1] Pr [A;_1]. 
If A;_1 holds, then at the time of the i-th trial we have k + 7 total balls in 
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the urn, of which one is red. So the probability that we draw a black ball is 


1- mi = tt, By induction we can then show that 


j=l 


k+j-1 
k+j 


This is an example of a collapsing product, where the denominator of each 
fraction cancels out the numerator of the next; we are left only with the 
denominator k +7 of the last term and the numerator & of the first, giving 
Pr [A,] = aoe It follows that we make it through all n trials with probability 


Pr{A,|= Fn: 


12.1.5.2 The law of total probability 


We can use the fact that A is the disjoint union of AN B and AN B to get 
Pr [A] by case analysis: 


Pr[A] = Pr[An B]+Pr[AnB 
= Pr[A| B]Pr[B] + Pr [A | By Pr [B]. 


For example, if there is a 0.2 chance I can make it to the top of Mt 
Everest safely without learning how to climb first, my chances of getting 
there go up to (0.9)(0.1) + (0.2)(0.9) = 0.27. 

This method is sometimes given the rather grandiose name of the law 
of total probability. The most general version is that if B,...B, are all 
disjoint events and the sum of their probabilities is 1, then 


Pr[A] = 5 Pr[A | Bi] Pr [Bi]. 
i=1 


12.1.5.3  Bayes’s formula 
If one knows Pr[{A| B], Pr[A | =B], and Pr[B], it’s possible to compute 
Pr [B | Al: 
Pr [An B] 

Pr [A] 
Pr [A | B] Pr [B] 

Pr [A] 

_ Pr [A | B] Pr[B] 

Pr[A| B]Pr[B]+Pr[A | B| Pr [B] 


Pr[B | A] = 
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This formula is used heavily in statistics, where it goes by the name of 
Bayes’s formula. Say that you have an Airport Terrorist Detector that 
lights up with probability 0.75 when inserted into the nostrils of a Terrorist, 
but lights up with probability 0.001 when inserted into the nostrils of a 
non-Terrorist. Suppose that for other reasons you know that Granny has 
only a 0.0001 chance of being a Terrorist. What is the probability that 
Granny is a Terrorist if the detector lights up? 

Let B be the event “Granny is a terrorist” and A the event “Detector 
lights up.” Then Pr[B | A] = (0.75x0.0001) /(0.75 x 0.0001+0.001 x 0.9999) = 
0.0007495. This example shows how even a small false positive rate can 
make it difficult to interpret the results of tests for rare conditions. 


12.2 Random variables 


A random variable X is a variable that takes on particular values randomly. 
This means that for each possible value x, there is an event [X = x] with 
some probability of occurring that corresponds to X (the random variable, 
usually written as an upper-case letter) taking on the value x (some fixed 
value). Formally, a random variable X is really a function X(w) of the 
outcome w that occurs, but we save a lot of ink by leaving out w.4 


12.2.1 Examples of random variables 


e Indicator variables: The indicator variable for an event A is a vari- 
able X that is 1 if A occurs and 0 if it doesn’t (ie., X(w) =lifweA 
and 0 otherwise). There are many conventions out there for writing 
indicator variables. I am partial to 14, but you may also see them 
written using the Greek letter chi (e.g. y4) or by abusing the bracket 
notation for events (e.g., [A], [Y? > 3], [all six coins come up heads}). 


e Functions of random variables: Any function you are likely to run 
across of a random variable or random variables is a random variable. 
If X and Y are random variables, X +Y, XY, and log X are all random 
variables. 


e Counts of events: Flip a fair coin n times and let X be the number of 
times it comes up heads. Then X is an integer-valued random variable. 


“For some spaces, not all functions X (w) work as random variables, because the events 
[X = ax] might not be measurable with respect to F. We will generally not run into these 
issues. 
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e Random sets and structures: Suppose that we have a set T of n 
elements, and we pick out a subset U by flipping an independent fair 
coin for each element to decide whether to include it. Then U is a 
set-valued random variable. Or we could consider the infinite sequence 
Xo, X1,X2,..., where Xp = 0 and X,,41 is either X, +1 or X, — 1, 
depending on the result of independent fair coin flip. Then we can 
think of the entire sequence X as a sequence-valued random variable. 


12.2.2 The distribution of a random variable 


The distribution of a random variable describes the probability that it 
takes on various values. For real-valued random variables, the distribution 
function or cumulative distribution function is a function F(x) = 
Pr [|X < a]. This allows for very general distributions—for example, a variable 
that is uniform on (0, 1] can be specified by F(x) = x when 0 < x < 1, and 0or 
1 as appropriate outside this interval—but for discrete random variables 
that take on only countably many possible values, this is usually more power 
than we need. 

For discrete variables, the distribution is most easily described by just 
giving the probability mass function Pr [|X = 2] for each possible value x. 
If we need to, it’s not too hard to recover the distribution function from the 
mass function (or vice versa). So we will often cheat a bit and treat a mass 
function as specifying a distribution even if it isn’t technically a distribution 
function. 

Typically, if we know the distribution of a random variable, we don’t 
bother worrying about what the underlying probability space is. The reason 
for this is we can just take 2 to be the range of the random variable, and 
define Pr|w] for each w in 9 to be Pr[X =w]. For example, a six-sided 
die corresponds to taking Q = {1,2,3,4,5,6}, assigning Pr [w] = 1/6 for all 
w, and letting X(w) = w. This will give the probabilities for any events 
involving X that we would have gotten on whatever original probability 
space X might have been defined on. 

The same thing works if we have multiple random variables, but now 
we let each point in the space be a tuple that gives the values of all of 
the variables. Specifying the probability in this case is done using a joint 
distribution (see below). 


12.2.2.1 Some standard distributions 


Here are some common distributions for a random variable X: 
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e Bernoulli distribution: Pr[X = 1] = p, Pr[X = 0] = q, where p is 
a parameter of the distribution and q = 1 — p. This corresponds to a 
single biased coin-flip. 


e Binomial distribution: Pr LX = k] = ('})p*q'"—"), where n and p are 
parameters of the distribution and g = 1 — p. This corresponds to the 
sum of n biased coin-flips. 


e Geometric distribution: Pr [|X = k] = q"p, where p is a parameter 
of the distribution and q is again equal to 1 — p. This corresponds to 
number of tails we flip before we get the first head in a sequence of 
biased coin-flips. 


e Poisson distribution: Pr [X = k] = e~*\*/k!. This is what happens 
to a binomial distribution when we make p = \/n and then take the 
limit as n goes to infinity. We can think of it as counting the number 
of events that occur in one time unit if the events occur at a constant 
continuous rate that averages \ events per time unit. The canonical 
example is radioactive decay. 


e Uniform distribution: For the uniform distribution on [a,b], the 
distribution function F' of X is given by F(x) = 0 when z < a, (x — 
a)/(b— a) when a < & < 6, and 1 when b < z, where a and 0 are 
parameters of the distribution. This is a continuous random variable 
that has equal probability of landing anywhere in the [a, b] interval. 
The term uniform distribution may also refer to a uniform distribution 
on a finite set S; this assigns Pr[X = x] = I when z is in S and 0 
otherwise. As a distribution function, F(x) is the rather discontinuous 
function |{y € S| y < x}|/|SI. 


e Normal distribution: The normal distribution function is given by 


1 x 
(x) = Jin / e?/? da. 


This corresponds to another limit of the binomial distribution, where 
now we fix p = 1/2 but compute x 2 to converge to a single fixed 
distribution as n goes to infinity. The normal distribution shows up 
(possibly scaled and shifted) whenever we have a sum of many inde- 
pendent, identically distributed random variables: this is the Central 
Limit Theorem, and is the reason why much of statistics works, and 
why we can represent 0 and 1 bits using buckets of jumpy randomly- 


positioned electrons. 
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12.2.2.2 Joint distributions 


Two or more random variables can be described using a joint distribution. 
For discrete random variables, we often represent this as a joint probability 
mass function Pr[X = «AY = y] for all fixed values x and y, or more 
generally Pr [Vi : X; = 2;]. For continuous random variables, we may instead 
need to use a joint distribution function F'(x1,...,%p) = Pr [Wi: X; < ai]. 

Given a joint distribution on X and Y, we can recover the distribution on 
X or Y individually by summing up cases: Pr |X = 2] = 0, Pr[X =z AY =y] 
(for discrete variables), or Pr[X <a] = limy4.Pr[X <xAY <y] (for 
more general variables). The distribution of X obtained in this way is called 
a marginal distribution of the original joint distribution. In general, we 
can’t go in the other direction, because just knowing the marginal distribu- 
tions doesn’t tell us how the random variables might be dependent on each 
other. 


Examples 


e Let X and Y be six-sided dice. Then Pr[X =x AY = y] = 1/36 for 
all values of x and y in {1,2,3,4,5,6}. The underlying probability 
space consists of all pairs (x,y) in {1,2,3,4,6} x {1,2,3,4,5, 6}. 


e Let X bea six-sided die and let Y = 7—X. Then Pr[X =x AY =y]= 
1/6 if 1 <2 <6and y = 7 —2@, and 0 otherwise. The underlying 
probability space is most easily described by including just six points for 
the X values, although we could also do {1, 2,3, 4,5,6} x {1, 2,3, 4, 5, 6} 
as in the previous case, just assigning probability 0 to most of the 
points. However, even though the joint distribution is very different 
from the previous case, the marginal distributions of X and Y are 
exactly the same as before: each of X and Y takes on all values in 
{1,2,3,4,5,6} with equal probability. 


12.2.3. Independence of random variables 


The difference between the two preceding examples is that in the first case, 
X and Y are independent, and in the second case, they aren’t. 

Two random variables X and Y are independent if any pair of events 
of the form X € A, Y € B are independent. For discrete random variables, 
it is enough to show that Pr[X =xAY =y] = Pr[X =a]-Pr[Y = y], or 
in other words that the events |X = 2] and [Y = y] are independent for all 
values x and y. For continuous random variables, the corresponding equation 
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is Pr[X <aAY <y]=Pr[X <a]-PrlY < y]. In practice, we will typically 
either be told that two random variables are independent or deduce it from 
the fact that they arise from separated physical processes. 


12.2.3.1 Examples 


e Roll two six-sided dice, and let X and Y be the values of the dice. By 
convention we assume that these values are independent. This means 
for example that Pr[X € {1,2,3} AY € {1,2,3}] = Pr[X € {1, 2, 3}]- 
Pr [Y € {1,2,3}] = (1/2)(1/2) = 1/4, which is a slightly easier com- 
putation than counting up the 9 cases (and then arguing that each 
occurs with probability (1/6)?, which requires knowing that X and Y 
are independent). 


e Take the same X and Y, and let 7 = X + Y. Now Z and X are not 
independent, because Pr[X = 1 A Z = 12] = 0, which is not equal to 
Pr [X = 1]-Pr[Z = 12] = (1/6)(1/36) = 1/216. 


e Place two radioactive sources on opposite sides of the Earth, and let X 
and Y be the number of radioactive decay events in each source during 
some 10 millisecond interval. Since the sources are 42 milliseconds away 
from each other at the speed of light, we can assert that either X and 
Y are independent, or the world doesn’t behave the way the physicists 
think it does. This is an example of variables being independent because 
they are physically independent. 


e Roll one six-sided die X, and let Y = | X/2] and Z = X mod 2. Then 
Y and Z are independent, even though they are generated using the 
same physical process. 


12.2.3.2 Independence of many random variables 


In general, if we have a collection of random variables X;, we say that 
they are all independent if the joint distribution is the product of the 
marginal distributions, i.e., if Pr [Vi : X; < x] =[], Pr[X; < 2]. It may be 
that a collection of random variables is not independent even though all 
subcollections are. 

For example, let X and Y be fair coin-flips, and let Z = X @Y. Then 
any two of X, Y, and Z are independent, but the three variables X, Y, and 
Z are not independent, because Pr [|X =O AY =0A Z = 0] = 1/4 instead 
of 1/8 as one would get by taking the product of the marginal probabilities. 
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Since we can compute the joint distribution from the marginal distri- 
butions for independent variables, we will often just specify the marginal 
distributions and declare that a collection of random variables are indepen- 
dent. This implicitly gives us an underlying probability space consisting of 
all sequences of values for the variables. 


12.2.4 The expectation of a random variable 


For a real-valued random variable X, its expectation E [|X] (sometimes just 
E_.X) is its average value, weighted by probability.° For discrete random 
variables, the expectation is defined by 


E(x] =) @Pr|xX =a), 


For a continuous random variable with distribution function F(x), the 
expectation is defined by 


[o.e) 
E[X] = / x dF (2). 
—oo 
The integral here is a Lebesgue-Stieltjes integral, which generalizes the 
usual integral for continuous F(x) by doing the right thing if F(a) jumps due 
to some x that occurs with nonzero probability. We will avoid thinking about 
this by mostly worrying about expectations for discrete random variables. 


Example (discrete variable) Let X be the number rolled with a fair 
six-sided die. Then E [X] = (1/6)(1+2+3+4+5+6) =35. 


Example (unbounded discrete variable) Let X be a geometric random 
variable with parameter p. This means that Pr |X = k] = q*p, where as 
usual g = 1—p. Then E[X] = R20 ka"p = p Deo ka" = p> Gt = 
pg_ag_il-p_1l_jy. 


pe op p p 


Expectation is a way to summarize the distribution of a random variable 
without giving all the details. If you take the average of many independent 
copies of a random variable, you will be likely to get a value close to the 
expectation. Expectations are also used in decision theory to compare 
different choices. For example, given a choice between a 50% chance of 


°Technically, this will work for any values we can add and multiply by probabilities. 
So if X is actually a vector in R® (for example), we can talk about the expectation of X, 
which in some sense will be the average position of the location given by X. 
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winning $100 (expected value: $50) and a 20% chance of winning $1000 
(expected value: $200), a rational decision maker would take the second 
option. Whether ordinary human beings correspond to an economist’s notion 
of a rational decision maker often depends on other details of the situation. 

Terminology note: If you hear somebody say that some random variable 
X takes on the value z on average, this usually means that E[X] = z. 


12.2.4.1 Variables without expectations 


If a random variable has a particularly annoying distribution, it may not 
have a finite expectation, even thought the variable itself takes on only finite 
values. This happens if the sum for the expectation diverges. 

For example, suppose I start with a dollar, and double my money every 
time a fair coin-flip comes up heads. If the coin comes up tails, I keep 
whatever I have at that point. What is my expected wealth at the end of 
this process? 

Let X be the number of times I get heads. Then X is just a geometric 
random variable with p = 1/2, so Pr[X = k] = (1 — (1/2))*(1/2)* = 27-*-1. 
My wealth is also a random variable: 2*. If we try to compute E |2*|, we 
get 


which diverges. Typically we say that a random variable like this has no 
expected value, although sometimes you will see people writing E ee = 0O. 


(For an even nastier case, consider what happens with E |(-2)* .) 


12.2.4.2 Expectation of a sum 


The expectation operator is linear: this means that E[X + Y] = E[X] + 
E[Y] and E[aX] = aE[X] when a is a constant. This fact holds for all 
random variables X and Y, whether they are independent or not, and is not 
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hard to prove for discrete probability spaces: 


E[aX+Y]= S J(ax+y)Pr[X =2£AY =a] 
xy 
=a) €Pe(X =e y Ho| 4) grr ajany a3 
x,y zy 


= GS 8) PrlX=2ny-= 7/4 9) Pr Xx a=aany =F] 
x y y x 

=a) xPr[X =2]+) yPr[Y =y] 
x y 

= aE[X]+E[Y]. 


Linearity of expectation makes computing many expectations easy. Ex- 
ample: Flip a fair coin n times, and let X be the number of heads. What is 
E[X]? We can solve this problem by letting X; be the indicator vari- 
able for the event “coin i came up heads.” Then X = 7, X; and 
E[X] = E[c2, Xi] = DL, E[X] = 02, 5 = F. In principle it is possible 
to calculate the same value from the distribution of X (this involves a lot of 
binomial coefficients), but linearity of expectation is much easier. 


Example Choose a random permutation 7, i.e., a random bijection from 
{1...n} to itself. What is the expected number of values 7 for which 7(7) = 7? 

Let X; be the indicator variable for the event that (i) = i. Then we are 
looking for E[X1 + X2+...Xn] = E[Xi1]+E[Xe]+...E[X,]. But E [Xj] is 
just 1/n for each i, so the sum is n(1/n) = 1. Calculating this by computing 
Pr [So*_, X; = 2] first would be very painful. 


12.2.4.3 Expectation of a product 


For products of random variables, the situation is more complicated. Here 
the rule is that E[XY] = E[X]-E[Y] if X and Y are independent. But if 
X and Y are not independent, the expectation of their product can’t be 
computed without considering their joint distribution. 

For example: Roll two dice and take their product. What value do we 
get on average? The product formula gives E[XY] = E[X]E[Y] = (7/2)? = 
(49/4) = 124. We could also calculate this directly by summing over all 36 
cases, but it would take a while. 

Alternatively, roll one die and multiply it by itself. Now what value do 
we get on average? Here we are no longer dealing with independent random 


variables, so we have to do it the hard way: E[X?] = (1? + 2? +374 4? 4 
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5? + 6?)/6 = 91/6 = 152. This is substantially higher than when the dice 
are uncorrelated. (Exercise: How can you rig the second die so it still comes 
up with each value 4 of the time but minimizes E[XY]?) 

We can prove the product rule without too much trouble for discrete 
random variables. The easiest way is to start from the right-hand side. 


E[X]-E[Y] = Caeaae (Surety =u) 
y 


= Pune x] Pr[Y = y] 

-y+( Ss. Prize Sal Pry = u) 
7 \ewayae 

= ay seen 

= Sr Pr[X¥ <4 

_BIXY). 


Here we use independence in going from Pr[X = 2] Pr[Y = y] toPr[X =xAY =y| 
and use the union rule to convert the x,y sum into Pr[XY = z]. 


12.2.4.4 Conditional expectation 


Like conditional probability, there is also a notion of conditional expecta- 
tion. The simplest version of conditional expectation conditions on a single 
event A, is written EX | A], and is defined for discrete random variables by 


E[X | A] = es =2z| Al. 


This is exactly the same as ordinary expectation except that the proba- 
bilities are now all conditioned on A. 

To take a simple example, consider the expected value of a six-sided die 
conditioned on not rolling a 1. The conditional probability of getting 1 is 
now 0, and the conditional probability of each of the remaining 5 values is 
1/5, so we get (1/5)(2+3+4+5+46) =4. 

Conditional expectation acts very much like regular expectation, so for 
example we have E[aX + bY | A] =aE[X | A]+0E[Y | A]. 

One of the most useful applications of conditional expectation is that it 
allows computing (unconditional) expectations by case analysis, using the 
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fact that 
E[X] = E[X | A] Pr[A] + E[X | =A] Pr[-4]. 
or, more generally, 


B[X] = E(x | A Pr [Ai 


when Aj, Ag,... are disjoint events whose union is the entire probability 
space 2. This is the expectation analog of the law of total probability. 


Examples 


e I have a 50% chance of reaching the top of Mt Everest, where Sir 
Edmund Hilary and Tenzing Norgay hid somewhere between 0 and 10 
kilograms of gold (a random variable with uniform distribution). How 
much gold do I expect to bring home? Compute 


E[X] = E[X | reached the top] Pr [reached the top] + E[X | didn’t] Pr [didn’t] 
=95-0.54+0-0.5 = 2.5. 


Suppose I flip a coin that comes up heads with probability p until I 
get heads. How many times on average do I flip the coin? 


We'll let X be the number of coin flips. Conditioning on whether 
the coin comes up heads on the first flip gives E[X] = 1-p+ (1+ 
E[X"])-(1—>p), where X’ is random variable counting the number of 
coin-flips needed to get heads ignoring the first coin-flip. But since X’ 
has the same distribution as X, we get E[X] = p+(1—p)(1+E|[X]) or 
Fix = pip) = 1/p. So a fair coin must be flipped twice on average 
to get a head, which is about what we’d expect if we hadn’t thought 
about it much. 


Suppose I have my experimental test subjects complete a task that gets 
scored on a scale of 0 to 100. I decide to test whether rewarding success 
is a better strategy for improving outcomes than punishing failure. So 
for any subject that scores high than 50, I give them a chocolate bar. 
For any subject that scores lower than 50, I give them an electric shock. 
(Students who score exactly 50 get nothing.) I then have them each 
perform the task a second time and measure the average change in 
their scores. What happens? 
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Let’s suppose that there is no effect whatsoever of my rewards and 
punishments, and that each test subject obtains each possible score with 
equal probability 1/101. Now let’s calculate the average improvement 
for test subjects who initially score less than 50 or greater than 50. 
Call the outcome on the first test X and the outcome on the second 
test Y. The change in the score is then Y — X. 


In the first case, we are computing E|Y — X | X < 50]. This is the same 
as E[Y | X < 50] -E[X | X < 50] = E[Y] —E[X | X < 50] = 50 —- 
24.5 = +25.5. So punishing failure produces a 25.5 point improvement 
on average. 


In the second case, we are computing E|[Y — X | X > 50]. This is the 
same as E[Y | X > 50] — E[|X | X > 50] = E[Y] —E[X | X > 50] = 
50 — 75.5 = —25.5. So rewarding success produces a 25.5 point decline 
on average. 


Clearly this suggests that we punish failure if we want improvements 
and reward success if we want backsliding. This is intuitively correct: 
punishing failure encourages our slacker test subjects to do better next 
time, while rewarding success just makes them lazy and complacent. 
But since the test outcomes don’t depend on anything we are doing, 
we get exactly the same answer if we reward failure and punish success: 
in the former case, a +25.5 point average change, in the later a —25.5 
point average change. This is also intuitively correct: rewarding failure 
makes our subjects like the test so that they will try to do better 
next time, while punishing success makes them feel that it isn’t worth 
it. From this we learn that our intuitions® provide powerful tools for 
rationalizing almost any outcome in terms of the good or bad behavior 
of our test subjects. A more careful analysis shows that we performed 
the wrong comparison, and we are the victim of regression to the 
mean. This phenomenon was one of several now-notorious cognitive 
biases described in a famous paper by Tversky and Kahneman [T'< 74]. 


For a real-world example of how similar problems can arise in processing 
data, the United States Bureau of Labor Statistics defines a small 
business as any company with 500 or fewer employees. So if a company 
has 400 employees in 2007, 600 in 2008, and 400 in 2009, then we 
just saw a net creation of 200 new jobs by a small business in 2007, 
followed by the destruction of 200 jobs by a large business in 2008. It 
has been argued that this effect accounts for much of the observed fact 


SOK, my intuitions. 
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that small businesses generate proportionally more new jobs than large 
ones, although the details are tricky [NWZ11]. 


12.2.4.5 Conditioning on a random variable 


There is a more general notion of conditional expectation for random variables, 
where the conditioning is done on some other random variable Y. Unlike 
E[X | A], which is a constant, the expected value of X conditioned on Y, 
written E[X | Y], is itself a random variable: when Y = y, it takes on the 
value E[X | Y = y]. 

Here’s a simple example. Let’s compute E[X + Y | X] where X and Y are 
the values of independent six-sided dice. When X = 7, E[E[X + Y | X]| X =a] = 
E[X+Y |X =2|)=2+E[Y] =2+7/2. For the full random variable we 
can write E[X +Y |X] =X+4+7/2. 

Another way to get the result in the preceding example is to use some 
general facts about conditional expectation: 


e E[aX +bY |Z] = aE|X | Z]+bE[Y | Z]. This is the conditional- 
expectation version of linearity of expectation. 


e E[X | X] = X. This is immediate from the definition, since EX | X = x] = 
nin 


e If X and Y are independent, then E[Y | X] = E[Y]. The intuition 
is that knowing the value of X gives no information about Y, so 


E|Y] X =x = E[Y] for any z in the range of X. (To do this for- 
Pr[Y =yAX =z] 
Pr[X=z] 


[Y = y], provided X and Y are independent and 


mally requires using the fact that Pr[Y = y | X = 2] = 
Pr[Y=y] Pr[X=a] _ By 


Pr[X=z] 
Prix =2|40,) 
e Also useful: E|E[X | Y]] = E[X]. Averaging a second time removes 


all dependence on Y. 


These in principle allow us to do very complicated calculations involving 
conditional expectation. 
Some examples: 


e Let X and Y be the values of independent six-sided dice. What is 
E[X | X + Y]? Here we observe that X+Y=E[X+Y|X+Y]= 
E[X |X+Y]+E[Y|X+Y] = 2E[X |X+Y] by symmetry. So 
E[X | X+Y]=(X+/Y)/2. This is pretty much what we’d expect: 
on average, half the total value is supplied by one of the dice. (It also 
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works well for extreme cases like X + Y = 12 or X + Y = 2, giving a 
quick check on the formula.) 


e What is E [(X + Y)? | X] when X and Y are independent? Compute 
E[(X+Y)? | X] =E[X?|X]+2E[XY | X]+E[Y?| X] = x74 
2X E[Y]+E[Y°?]. For example, if X and Y are independent six-sided 
dice we have E[(X + Y)? | X] = X?+7X + 91/6, so if you are rolling 
the dice one at a time and the first one comes up 5, you can expect on 
average to get a squared total of 25 + 35+ 91/6 = 75%. But if the first 
one comes up 1, you only get 1+ 74 91/6 = 234 on average. 


12.2.5 Markov’s inequality 


Knowing the expectation of a random variable gives you some information 
about it, but different random variables may have the same expectation but 
very different behavior: consider, for example, the random variable X that is 
0 with probability 1/2 and 1 with probability 1/2 and the random variable Y 
that is 1/2 with probability 1. In some cases we don’t care about the average 
value of a variable so much as its likelihood of reaching some extreme value: 
for example, if my feet are encased in cement blocks at the beach, knowing 
that the average high tide is only 1 meter is not as important as knowing 
whether it ever gets above 2 meters. Markov’s inequality lets us bound 
the probability of unusually high values of non-negative random variables as 
a function of their expectation. It says that, for any a > 0, 


Pr [|X > aE[X]] < 1/a. 
This can be proved easily using conditional expectations. We have: 
E[X] =E[X | X > aE[X]]Pr[X > aE[X]]+E[X] X < aE[X] Pr[X < aE[X]]. 


Since X is non-negative, EX | X < aE[X]] > 0, so subtracting out the last 
term on the right-hand side can only make it smaller. This gives: 


E[X] > E[X | X > aE[X]] Pr[X > aE[X]| 
> aE[X]Pr[X > aE[X]], 


and dividing both side by aE [X] gives the desired result. 
Another version of Markov’s inequality replaces > with >: 


Pr [|X > aE[X]] < 1/a. 


The proof is essentially the same. 


CHAPTER 12. PROBABILITY THEORY 246 


12.2.5.1 Example 


Suppose that that all you know about the high tide height X is that 
E[X] = 1 meter and X > 0. What can we say about the probability 
that X > 2 meters? Using Markov’s inequality, we get Pr[X > 2 meters] = 
Prix S2E [x |lt< 172: 


12.2.5.2 Conditional Markov’s inequality 
There is, of course, a conditional version of Markov’s inequality: 
Pr[X >aE[X | A] | A] < 1/a. 


This version doesn’t get anywhere near as much use as the unconditioned 
version, but it may be worth remembering that it exists. 


12.2.6 The variance of a random variable 


Expectation tells you the average value of a random variable, but it doesn’t 
tell you how far from the average the random variable typically gets: the 
random variables X = 0 and Y = +1, 000, 000, 000, 000 with equal probability 
both have expectation 0, though their distributions are very different. Though 
it is impossible to summarize everything about the spread of a distribution in 
a single number, a useful approximation for many purposes is the variance 
Var [|X] of a random variable X, which is defined as the expected square of 


the deviation from the expectation, or E (x —E[X 7] ‘ 


Example Let X be 0 or 1 with equal probability. Then E[X] = 1/2, and 
(X — E[X])? is always 1/4. So Var [X] = 1/4. 


Example Let X be the value of a fair six-sided die. Then E[X] = 7/2, and 


E |(X —E[X])?| = 4 (1-7/2)? + @—7/2)? + (8-7/2)? ++ + 6 —7/2)?) = 


35/12. 


Computing variance directly from the definition can be tedious. Often it 
is easier to compute it from E[X?] and E[X]: 


Var [X] = E|(X — E[X])?] 


= B[X? - 2X E[X] + (E[X])”| 


= E[x?] —2E[X]E[X] + E[X]) 


= E[X?] - (E[X]). 
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The second-to-last step uses linearity of expectation and the fact that 
E[X] is a constant. 


Example For X being 0 or 1 with equal probability, we have E |X “| = 1/2 
and (E[X])? = 1/4, so Var [X] = 1/4. 


Example Let’s try the six-sided die again, except this time we’ll use an 
n-sided die. We have 


Var [X] = E[X?] - (B[X])? 


Ls (nity 
= a 
n* 2 


1=1 
1 n(ntij2Qn4+1)) (n+1)? 
a_i 6 4 
— (n+1)Qn+1) (n+1)? 
= 7 a 


When n = 6, this gives a — =. (Ok, maybe it isn’t always easier). 


12.2.6.1 Multiplication by constants 


Suppose we are asked to compute the variance of cX, where c is a constant. 
We have 


Var [eX] = E [(cX)?] - B[ex) 
= ? E[X?| - (cE[X])? 
= ¢? Var [X]. 
So, for example, if X is 0 or 2 with equal probability, Var [X] = 4-(1/4) = 
1. This is exactly what we expect given that X — E[X] is always +1. 


Another consequence is that Var [—X] = (—1)? Var [X] = Var [X]. So 
variance is not affected by negation. 
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12.2.6.2 The variance of a sum 


What is Var[X + Y]? Write 


Var [X + ¥] =E|(X +¥)?] - (B[X+Y]) 
= E[X?] +2B[XY]+E[¥?] - (B[X])? - 2E[X]-E[¥] - (E[¥])? 
= (E|X?| - (B[X])?) + @ [¥?] - ([¥])?) + CE [XY] - B[X] -E[¥]) 
= Var [X] + Var [Y] + 2(E[XY] — E[X]-E[Y)). 


The quantity E [XY] — E[X] E[Y] is called the covariance of X and Y 
and is written Cov [X,Y]. So we have just shown that 


Var [|X + Y] = Var [X] + Var [Y] + 2 Cov [X,Y]. 


When Cov [X,Y] = 0, or equivalently when E[XY] = E[X]E[Y], X 
and Y are said to be uncorrelated and their variances add. This occurs 
when X and Y are independent, but may also occur without X and Y being 
independent. 

For larger sums the corresponding formula is 


n n 
Var is x] = S¢ Var [Xi] + S- Cov [X;i, Xj] . 
i=1 i=1 i#j 

This simplifies to Var [> X;] = >> Var[X;] when the X; are pairwise 
independent, so that each pair of distinct X; and Xj; are independent. 
Pairwise independence is implied by independence (but is not equivalent to 
it), so this also works for fully independent random variables. 

For example, we can use the simplified formula to compute the variance 
of the number of heads in n independent fair coin-flips. Let X; be the 
indicator variable for the event that the i-th flip comes up heads and let 
X be the sum of the X;. We have already seen that Var [X;] = 1/4, so 
Var [X] = n Var [X;] = n/4. 

Similarly, if c is a constant, then we can compute Var |X + c] = Var[X]+ 
Var [c] = Var[X], since (1) E[cX] = cE[X] = E[c|E[X] means that c 
(considered as a random variable) and X are uncorrelated, and (2) Var [a] = 
E l(c -—E [el)”| = E [0] = 0. So shifting a random variable up or down doesn’t 


change its variance. 
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12.2.6.3 Chebyshev’s inequality 


Variance is an expectation, so we can use Markov’s inequality on it. The 
result is Chebyshev’s inequality, which like Markov’s inequality comes in 
two versions: 


iz Var [X] 


Pr [|X — E[X]| > 17] 


re 
Var [X] 
2 ~ 
Proof. We’ll do the first version. The event |X — E[X]| > r is the same as 
the event (X —E[X ))? > r?. By Markov’s inequality, the probability that 
E[(X-E[X])?] _ Var[X] 
r2 hee 


Pr [|X — E[X]| > r] < 


this occurs is at most 


Application: showing that a random variable is close to its expec- 
tation This is the usual statistical application. 


Example Flip a fair coin n times, and let X be the number of heads. What 
is the probability that |X — n/2| > r? Recall that Var |X] = /4, so 
Pr [|X — n/2| > 1] < (n/4)/r? = n/(4r?). So, for example, the chances 
of deviating from the average by more than 1000 after 1000000 coin-flips 
is less than 1/4. 


Example Out of n voters in Saskaloosa County, m plan to vote for Smith for 
County Dogcatcher. A polling firm samples & voters (with replacement) 
and asks them who they plan to vote for. Suppose that m < n/2; 
compute a bound on the probability that the polling firm incorrectly 
polls a majority for Smith. 


Solution: Let X; be the indicator variable for a Smith vote when the 
i-th voter is polled and let X = 5° X; be the total number of pollees 
who say they will vote for Smith. Let p = E[X;] = m/n. Then 
Var [X;] = p — p*, E[X] = kp, and Var [X] = k(p — p?). To get a 
majority in the poll, we need X > k/2 or X —E[X] > k/2—kp. Using 
Chebyshev’s inequality, this event occurs with probability at most 


Var[X]___k(p — p”) 
(k/2—kp)? — (k/2 — kp)? 
_1 p-p 

— k (1/2—p)?’ 


Note that the bound decreases as k grows and (for fixed p) does not 
depend on n. 
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In practice, statisticians will use a stronger result called the central 
limit theorem, which describes the shape of the distribution of the sum of 
many independent random variables much more accurately than the bound 
from Chebyshev’s inequality. Designers of randomized algorithms are more 
likely to use Chernoff bounds. 


Application: lower bounds on random variables Unlike Markov’s 
inequality, which can only show that a random variable can’t be too big too 
often, Chebyshev’s inequality can be used to show that a random variable 
can’t be too small, by showing first that its expectation is high and then 
that its variance is low. For example, suppose that each of the 10°° oxygen 
molecules in the room is close enough to your mouth to inhale with pairwise 
independent probability 10~* (it’s a big room). Then the expected number 
of oxygen molecules near your mouth is a healthy 10° - 10-4 = 102°. What 
is the probability that all 107° of them escape your grasp? 

Let X; be the indicator variable for the event that the 7-th molecule is 
close enough to inhale. We’ve effectively already used the fact that E [Xj] = 
10~*. To use Chebyshev’s inequality, we also need Var [X;] = E [X?] — 
EX)? = 10-4 — 10-8 = 10-4. So the total variance is about 102° - 10-4 = 
107° and Chebyshev’s inequality says we have Pr [|X — E[X]| > 107°] 
1076 /(1076)? = 10-26. So death by failure of statistical mechanics is unlikely 
(and the real probability is much much smaller). 

But wait! Even a mere 90% drop in Og levels is going to be enough to 
cause problems. What is the probability that this happens? Again we can 
calculate Pr [90% drop] < Pr [|X — E[X]| > 0.9- 1076] < 1076/(0.9-1076)? = 
1.23 - 10726. So even temporary asphyxiation by statistical mechanics is not 
something to worry about. 


IA 


12.2.7 Probability generating functions 


For a discrete random variable X taking on only values in N, we can express 
its distribution using a probability generating function or pgf: 


Py) = Prix =a) 2". 
n=0 


These are essentially standard-issue generating functions (see §11.3) with 
the additional requirement that all coefficients are non-negative and F'(1) = 1. 

A trivial example is the pgf for a Bernoulli random variable (1 with 
probability p, 0 with probability q = 1 —p). Here the pef is just q+ pz. 
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A more complicated example is the pgf for a geometric random variable. 


Now we have )°??.9 q’pz” = p>p-0(qz)” = re 


12.2.7.1 Sums 


A very useful property of pgf’s is that the pgf of a sum of independent random 
variables is just the product of the pgf’s of the individual random variables. 
The reason for this is essentially the same as for ordinary generating functions: 
when we multiply together two terms (Pr [|X = n] 2”)(Pr[Y = mJ] 2), we 
get Pr[X =n AY =m] z”*™, and the sum over all the different ways of 
decomposing n+ ™ gives all the different ways to get this sum. 

So, for example, the pgf of a binomial random variable equal to the sum 
of n independent Bernoulli random variables is (q+ pz)” (hence the name 
“binomial”). 


12.2.7.2 Expectation and variance 


One nice thing about pgf’s is that the can be used to quickly compute 
expectation and variance. For expectation, we have 


F@)= LS nPr[X =n]z”!. 
n=0 
So 
FO\= Ss nPrix al 
n=0 


= B[X]. 


If we take the second derivative, we get 


(oe) 


BG) = S- n(n —1)Pr[X =n] 2"! 
n=0 
7 iad Ge ys n(n — 1) Pr[X =n] 
n=0 
= E[X(X — 1)| 
= B[X?] - EX]. 


So we can recover E [X?] as F”(1) + F’(1) and get Var [X] as F”(1) + 
EG) GED) 
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Example If X is a Bernoulli random variable with pgf F = (q + pz), 
then F’ = p and F” = 0, giving E[X] = F’(1) = p and Var[X] = 
F"(1) + F'(1) — (F"(1))? = 0+ p— p? = v(1 — p) = pg. 


Example If X is a binomial random variable with pgf F = (q+ pz)”, 
then F’ = n(q+ pz)""1p and F” = n(n —1)(q + pz)"~*p", giving 
E[X] = F’(1) = np and Var [X] = F”(1) + F’(1) — (F’(1))? = n(n - 
1)p? + np — n?2p? = np — np? = npq. These values would, of course, 
be a lot faster to compute using the formulas for sums of independent 
random variables, but it’s nice to see that they work. 


Example If X is a geometric random variable with pgf p/(1 — qz), then 
F’ = pq/(1 — qz)* and F” = 2pq?/(1 — qz)?. So E[X] = F’(1) = 
pq/(1 — 4)? = pq/p? = g/p, and Var [X] = F"(1) + F'(1) — (F"(1))? = 
2pq?/(1— 4)? +4/p— 0 /p* = 2¢?/v? + a/p— 4° /v? = @/p? +4/p. The 
variance would probably be a pain to calculate by hand. 


Example Let X be a Poisson random variable with rate >. We claimed 
earlier that a Poisson random variable is the limit of a sequence of 
binomial random variables where p = /n and n goes to infinity, so 
(cheating quite a bit) we expect that X’s pgf F = lim; +.((1 — A/n) + 
(A/n)z)” = (1+ (-A+ Az) /n)" = exp(—A + Az) = exp(—A) AZ” /nl. 
We can check that the total probability F(1) = exp(—A+ A) = e® = 1, 
that the expectation F’(1) = A\exp(—A+ A) = A, and that the variance 
F"(1) + F'(1) — (F"(1))? = 2 exp(—A +A) + A—A2 = A. These last 
two quantities are what we’d expect if we calculated the expectation 
and the variance directly as the limit of taking n Bernoulli random 
variables with expectation \/n and variance (A/n)(1 — A/n) each. 


12.2.8 Summary: effects of operations on expectation and 
variance of random variables 


E[X+Y]=E[X]+E[Y] Var [X + Y] = Var [X] + Var [Y] + 2 Cov [X, Y] 
E [aX] =aE[X] Var [aX] = a? Var [X] 
E [XY] = E[X]E[Y] + Cov [X, Y] 


For the second line, a is a constant. None of these formulas assume 
independence, although we can drop Cov [X, Y] (because it is zero) whenever 
X and Y are independent. There is no simple formula for Var [XY]. 
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The expectation and variance of X — Y can be derived from the rules for 
addition and multiplication by a constant: 


E[X —Y] =E[X +(-Y)] 
= E[X]+E[-Y] 
= E[X]—ElY], 


and 


Var [X — Y] = Var [X + (-Y)] 
= Var [X] + Var |—Y] + 2 Cov [X, -Y] 
= Var [X] + Var [Y] — 2 Cov [X,Y]. 


12.2.9 The general case 


So far we have only considered discrete random variables, which avoids a lot of 
nasty technical issues. In general, a random variable on a probability space 
(Q, F, P) is a function whose domain is 2 that satisfies some extra conditions 
on its values that make interesting events involving the random variable 
elements of F. Typically the codomain will be the reals or the integers, 
although any set is possible. Random variables are generally written as 
capital letters with their arguments suppressed: rather than writing X(w), 
where w € 2, we write just X. 

A technical condition on random variables is that the inverse image of 
any measurable subset of the codomain must be in -/—in simple terms, if 
you can’t nail down w exactly, being able to tell which element of F you land 
in should be enough to determine the value of X(w). For a discrete random 
variables, this just means that X~!(a) € F for each possible value x. For 
real-valued random variables, the requirement is that the event [X < z] is in 
F for any fixed x. In each case we say that X is measurable with respect 
to F (or just “measurable ¥”).’ Usually we will not worry about this issue 
too much, but it may come up if we are varying F to represent different 
amounts of information available to different observers (e.g., if X and Y are 
the values of two dice, X is measurable to somebody who can see both dice 
but not to somebody who can only see the sum of the dice). 

The distribution function of a real-valued random variable describes 
the probability that it takes on each of its possible values; it is specified 


"The detail we are sweeping under the rug here is what makes a subset of the codomain 
measurable. The essential idea is that we also have a o-algebra F’ on the codomain, and 
elements of this codomain o-algebra are the measurable subsets. The rules for simple 
random variables and real-valued random variables come from default choices of o-algebra. 
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by giving a function F(a) = Pr[X <2]. The reason for using Pr[X < z] 
instead of Pr [|X = 2] is that it allows specifying continuous random variables 
such as a random variable that is uniform in the range [0,1]; this random 
variable has a distribution function given by F(x) = x when 0 < x < 1, 
F@) = 0 tor 2:<.0, and, P(e) = 1 tora > 1. 

For discrete random variables the distribution function will have dis- 
continuous jumps at each possible value of the variable. For example, the 
distribution function of a variable X that is 0 or 1 with equal probability is 
F(x) =0 for x <0,1/2 forO<a2<1,and1forz>1. 

Knowing the distribution of a random variable tells you what that variable 
might do by itself, but doesn’t tell you how it interacts with other random 
variables. For example, if X is 0 or 1 with equal probability then X and 1—X 
both have the same distribution, but they are connected in a way that is 
not true for X and some independent variable Y with the same distribution. 
For multiple variables, a joint distribution gives the probability that 
each variable takes on a particular value; for example, if X and Y are 
two independent uniform samples from the range [0,1], their distribution 
function F(a,y) = Pr[X <aAY < y] = zy (when 0 < z,y < 1). If instead 
Y =1-X, we get the distribution function F(x,y) = Pr[X <xAY <y|j 
equal to x when y > 1— <2 and 0 when y < 1 — a (assuming 0 < x,y < 1). 

We’ve seen that for discrete random variables, it is more useful to look at 
the probability mass function f(x) = Pr [|X = 2]. We can always recover 
the probability distribution function from the probability mass function if 
the latter sums to 1. 


12.2.9.1 Densities 


If a real-valued random variable is continuous in the sense of having a 
distribution function with no jumps (which means that it has probability 0 of 
landing on any particular value), we may be able to describe its distribution 
by giving a density instead. The density is the derivative of the distribution 
function. We can also think of it as a probability at each point defined in the 
limit, by taking smaller and smaller regions around the point and dividing 
the probability of landing in the region by the size of the region. 

For example, the density of a uniform [0,1] random variable is f(a) = 1 
for x in [0,1], and f(x) =0 otherwise. For a uniform [0,2] random variable, 
we get a density of 5 throughout the [0,2] interval. The density always 
integrates to 1. 

Some distributions are easier to describe using densities than using distri- 
bution functions. The normal distribution, which is of central importance 
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in statistics, has density 
ee 
V2 
Its distribution function is the integral of this quantity, which has no 
closed-form expression. 
Joint densities also exist. The joint density of a pair of random variables 
with joint distribution function F(x, y) is given by the partial derivative 


fyS sa F (x,y). The intuition here again is that we are approximating 
the (zero) probability at a point by taking the probability of a small region 


around the point and dividing by the area of the region. 


12.2.9.2 Independence 


Independence is the same as for discrete random variables: Two random 
variables X and Y are independent if any pair of events of the form X € A, 
Y € B are independent. For real-valued random variables it is enough to 
show that their joint distribution F(x, y) is equal to the product of their 
individual distributions F'y(x)Fy(y). For real-valued random variables with 
densities, showing the densities multiply also works. Both methods generalize 
in the obvious way to sets of three or more random variables. 


12.2.9.3 Expectation 


If a continuous random variable has a density f(x), the formula for its 
expectation is 


E [Xx] = [ ef(2) at 


For example, let X be a uniform random variable in the range [a, b]. 
Then f(x) = = when a < x < b and 0 otherwise, giving 


—a 


1 
B[x]= [ re dx 
x |’ 
~ 2(b—a) Lo 
b? — a? 
~ 2(b—a) 
a+b 
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For continuous random variables without densities, we land in a rather 
swampy end of integration theory. We will not talk about this case if we can 
help it. But in each case the expectation depends only on the distribution of 
X and not on its relationship to other random variables. 


Chapter 13 


Linear algebra 


Linear algebra is the branch of mathematics that studies vector spaces 
and linear transformations between them. 


13.1 Vectors and vector spaces 


Let’s start with vectors. In the simplest form, a vector consists of a sequence 
of n values from some field (see §4.1); for most purposes, this field will be R. 
The number of values (called coordinates) in a vector is the dimension 
of the vector. The set of all vectors over a given field of a given dimension 
(e.g., R”) forms a vector space, which has a more general definition that 
we will give later. 

So the idea is that a vector represents a point in an n-dimensional space 
represented by its coordinates in some coordinate system. For example, if 
we imagine the Earth is flat, we can represent positions on the surface of the 
Earth as a latitude and longitude, with the point (0,0) representing the origin 
of the system at the intersection between the equator (all points of the form 
(0, z) and the prime meridian (all points of the form (x, 0). In this system, the 
location of Arthur K. Watson Hall (AKW) would be (41.31337, —72.92508), 
and the location of LC 317 would be (41.30854, —72.92967). These are both 
offsets (measured in degrees) from the origin point (0,0). 


257 
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as a ee 
ee o+y = (4,1) 


one 
-- 
- 


x = (3,-1) 
Figure 13.1: Geometric interpretation of vector addition 


13.1.1 Relative positions and vector addition 


What makes this a little confusing is that we will often use vectors to represent 
relative positions as well.! So if we ask the question “where do I have to go 
to get to LC 317 from AKW?”, one answer is to travel —0.00483 degrees in 
latitude and —0.00459 degrees in longitude, or, in vector terms, to follow the 
relative vector (—0.00483, —0.00459). This works because we define vector 
addition coordinatewise: given two vectors x and y, their sum 2+ y is defined 
by (x + y); = x1 + y: for each index 7. In geometric terms, this has the effect 
of constructing a compound vector by laying vectors x and y end-to-end and 
drawing a new vector from the start of x to the end of y (see Figure 13.1.) 

The correspondence between vectors as absolute positions and vectors 
as relative positions comes from fixing an origin 0. If we want to specify an 
absolute position (like the location of AKW), we give its position relative to 
the origin (the intersection of the equator and the prime meridian). Similarly, 
the location of LC 317 can be specified by giving its position relative to the 
origin, which we can compute by first going to AKW ((41.31337, —72.92508)), 
and then adding the offset of LC 317 from AWK ((—0.00483, —0.00459)) to 
this vector to get the offset directly from the origin ((41.30854, —72.92967) ). 

More generally, we can add together as many vectors as we want, by 
adding them coordinate-by-coordinate. 

This can be used to reduce the complexity of pirate-treasure instructions: 


1. Yargh! Start at the olde hollow tree on Dead Man’s Isle, if ye dare. 


‘A further complication that we will sidestep completely is that physicists will often use 
“vector” to mean both an absolute position and an offset from it—sort of like an edge in a 
graph—requiring n coordinates to represent the starting point of the vector and another n 
coordinates to represent the ending point. These vectors really do look like arrows at a 
particular position in space. Our vectors will be simpler, and always start at the origin. 
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2. Walk 10 paces north. 

3. Walk 5 paces east. 

4. Walk 20 paces south. 

5. Walk 6\/2 paces northwest. 
6. Dig 8 paces down. 


7. Climb back up 6 paces. There be the treasure, argh! 


In vector notation, this becomes: 


1. (0,0, 0) 

2. + (10,0,0) 
3. + (0,5, 0) 
4. + (—20,0,0) 
5. + (6, —6, 0) 
6. + (0,0, —8) 
7. + (0,0,6) 


which sums to (—4,—1,—2). So we can make our life easier by walking 4 
paces south, 1 pace west, and digging only 2 paces down. 


13.1.2 Scaling 


Vectors may also be scaled by multiplying each of their coordinates by an 
element of the base field, called a scalar. For example, if x = (—4, —1,—2) 
is the number of paces north, east, and down from the olde hollow tree to 
the treasure in the previous example, we can scale x by 2 to get the number 
of paces for Short-Legged Pete. This gives 
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13.2 Abstract vector spaces 


So far we have looked at vectors in R", which can be added together (by 
adding their coordinates) and scaled (by multiplying by an element of R). 
In the abstract, a vector space is any set that supports these operations, 
consistent with certain axioms that make it behave like we expect from our 
experience with R”. 

Formally, a vector space consists of vectors, which form an additive 
Abelian group,” and scalars, which can be used to scale vectors through 
scalar multiplication. The scalars are assumed to be a field (see §4.1); the 
reals R and complex numbers C are typical choices. 

Vector addition and scalar multiplication are related by a distributive 
law and some consistency requirements. When a and 0 are scalars, and x 
and y are vectors, we have 


a(x + y) = ax + ay 


(a+ b)x =ax+ br 
Oz = 0 
lx=2x 


a(ba) = (ab)x 


Note that in Oz = 0, the 0 on the left-hand side is a scalar while the 0 on 
the right-hand side is a vector. 

To avoid confusing between scalars, some writers will mark vectors using 
boldface (x) or with a superimposed arrow (#). Both are annoying enough 
to type that we will not use either convention. 

It’s not hard to see that the R” vectors we defined earlier satisfy this 
definition. But there are other examples of vector spaces: 


e The complex numbers C form a two-dimensional vector space over the 
real numbers R. This is because we can represent any complex number 
a+ bi as a two-dimensional vector (a,b), and ordinary complex-number 
addition and multiplication by reals behaves just like vector addition 
and scalar multiplication in R?: (a + bi) + (e+ di) = (a+b) + (c+d)i, 
r(a+ bi) = (ra) + (rb)i. 


e If F isa field and S is any set, then the set FS of functions f : S > F is 
a vector space, where the scalars are the elements of F’, f+ g is defined 


?This means that there is an addition operation for vectors that is commutative 
(©+y =y+2), associative («+ (y+ z) = («+ y)+4+ 2), and has an identity element 0 
(0+a=2+0=2) and inverses —x (x + (—a) = 0). 
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by (f + 9)(x) = f(x) + g(a), and af is defined by (af)(x) = a- f(z). 
Our usual finite-dimensional real vector spaces are special cases of this, 
where S = {1,...,n} and F=R. 


e The set of all real-valued random variables on a probability space forms 
a vector space, with X + Y and aX defined in the usual way. For 
discrete probability spaces, this is another special case of F°, since each 
random variable is really an element of R®. For general probability 
spaces, there are some technical conditions with measurability that 
mean that we don’t get all of R®, but it’s still the case that X +Y and 
aX are random variables whenever X and Y are, giving us a vector 
space. 


e In some application areas, it’s common to consider restricted classes 
of functions with some nice properties; for example, we might look at 
functions from R to R that are continuous or differentiable, or infinite 
sequences N > R that converge to a finite sum. These restricted classes 
of functions are all vector spaces as long as f + g and af are in the 
class whenever f and g are. 


13.3 Matrices 


We’ve seen that a sequence a1, d2,...,@y is really just a function from some 
index set ({1...n} in this case) to some codomain, where a; = a(i) for each 7. 
What if we have two index sets? Then we have a two-dimensional structure: 


Ay Aj 
A= |Ao1 Age 
ve re 


where A;; = a(i,j), and the domain of the function is just the cross- 
product of the two index sets. Such a structure is called a matrix. The 
values A;; are called the elements or entries of the matrix. A sequence of 
elements with the same first index is called a row of the matrix; similarly, a 
sequence of elements with the same second index is called a column. The 
dimension of the matrix specifies the number of rows and the number of 
columns: the matrix above has dimension (3, 2), or, less formally, it is a 3 x 2 
matrix.’ A matrix is square if it has the same number of rows and columns. 

Note: The convention in matrix indices is to count from 1 rather than 0. 
In programming language terms, matrices are written in FORTRAN. 


’The convention for both indices and dimension is that rows come before columns. 
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Variables representing matrices are usually written with capital letters. 
This is to distinguish them from both scalars and vectors. 


13.3.1 Interpretation 


We can use a matrix any time we want to depict a function of two arguments 
(over small finite sets if we want it to fit on one page). A typical example (that 
predates the formal notion of a matrix by centuries) is a table of distances 
between cities or towns, such as this example from 1807:* 
- INDEX OF DISTANCES FROM TOWN TO TOWN, 
Tun the County of Hertford. 


_ The names of the respective Towns are on the top and side, and the square where 
both mect gives the distance, 


- 
St. Alpan’s,. 2 Distant ‘em dondon, . 
19) Baldeaes... 4 «tt o 2 ER. 7 
Ce | re eo ee ee 1 
i2 25}18] Berkhamstead, . . . « . 


Hemel Hempstead G}22}14) 5)30]11 pemmionrtes) ae iS 
_ \Hertford, . + flofisligiza}ie] 7]18 ghar 5, baie te 
¥ a Hod 

Puckridge, . 

Rickmansworth, 

Roy ston, , 4 

evenage, « 6 | 


Ware, . 
w 


Because distance matrices are symmetric (see below), usually only half 
of the matrix is actually printed. 

Another example would be a matrix of counts. Suppose we have a set 
of destinations D and a set of origins O. For each pair (i, 7) € D x O, let 
Ci; be the number of different ways to travel from j to 7. For example, let 
origin 1 be Bass Library, origin 2 be AKW, and let destinations 1, 2, and 
3 be Bass, AKW, and SML. Then there is 1 way to travel between Bass 
and AKW (walk), 1 way to travel from AKW to SML (walk), and 2 ways 
to travel from Bass to SML (walk above-ground or below-ground). If we 
assume that we are not allowed to stay put, there are 0 ways to go from Bass 
to Bass or AKW to AKW, giving the matrix 


0 1 
C= wo 
2 1 


‘The original image is taken from http: //www.hertfordshire-genealogy.co.uk/ 
data/books/books-3/book-0370-cooke-1807.htm. As an exact reproduction of a public 
domain document, this image is not subject to copyright in the United States. 
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Wherever we have counts, we can also have probabilities. Suppose we 
have a particle that moves between positions 1...n by flipping a coin, and 
moving up with probability $ and down with probability 5 (staying put if it 
would otherwise move past the endpoints). We can describe this process by 
a transition matrix P whose entry Pj; gives the probability of moving to 
i starting from 7. For example, for n = 4, the transition matrix is 


1/2 1/2 0 0 
1/210", B72? 70 
0 1/2 oO 1/2" 
Or 0% , Te 


p= 


Finally, the most common use of matrices in linear algebra is to represent 
the coefficients of a linear transformation, which we will describe later. 
13.3.2 Operations on matrices 


Some functions of matrices are useful enough that they have names. 


13.3.2.1  Transpose of a matrix 


The transpose of a matrix A, written A' or A’, is obtained by reversing 
the indices of the original matrix; (A'),; = Aj; for each i and j. This has 
the effect of turning rows into columns and vice versa: 


An A]! 
A'=|Ao1 Age 
Az, Age 
_ {Ai Ag Asi 
Aj2 Ago Agze 


If a matrix is equal to its own transpose (i.e., if Ajj = Aj; for all i and J), 
it is said to be symmetric. The transpose of an n x m matrix is an m x n 
matrix, so only square matrices can be symmetric. 


13.3.2.2. Sum of two matrices 


If we have two matrices A and B with the same dimension, we can compute 
their sum A+ B by the rule (A+ B)j; = Aj; + Bj;. Another way to say this 
is that matrix sums are done term-by-term: there is no interaction between 
entries with different indices. 
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For example, suppose we have the matrix of counts C' above of ways 
of getting between two destinations on the Yale campus. Suppose that 
upperclassmen are allowed to also take the secret Science Hill Monorail from 
the sub-basement of Bass Library to the sub-basement of AKW. We can get 
the total number of ways an upperclassman can get from each origin to each 
destination by adding to C' a second matrix M giving the paths involving 
monorail travel: 


13.3.2.3. Product of two matrices 


Suppose we are not content to travel once, but have a plan once we reach our 
destination in D to travel again to a final destination in some set F’. Just as 
we constructed the matrix C (or C + M, for monorail-using upperclassmen) 
counting the number of ways to go from each point in O to each point in 
D, we can construct a matrix Q counting the number of ways to go from 
each point in D to each point in F’. Can we combine these two matrices to 
compute the number of ways to travel O > D—-> F? 

The resulting matrix is known as the product QC. We can compute 
each entry in QC by taking a sum of products of entries in Q and C’. Observe 
that the number of ways to get from k to i via some single intermediate 
point j is just Q;;Cj,. To get all possible routes, we have to sum over all 
possible intermediate points, giving (QC) ;, = >> 5 Vig Cir. 

This gives the rule for multiplying matrices in general: to get (AB)jx, 
sum A;;B;;, over all intermediate values 7. This works only when the number 
of columns in A is the same as the number of rows in B (since j has to vary 
over the same range in both matrices), i.e., when A is an n x m matrix and 
B is an m X s matrix for some n, m, and s. If the dimensions of the matrices 
don’t match up like this, the matrix product is undefined. If the dimensions 
do match, they are said to be compatible. 

For example, let B = (C+ M) from the sum example and let A be the 
number of ways of getting from each of destinations 1 = Bass, 2 = AKW, 
and 3 = SML to final destinations 1 = Heaven and 2 = Hell. After consulting 
with appropriate representatives of the Divinity School, we determine that 
one can get to either Heaven or Hell from any intermediate destination in 
one way by dying (in a state of grace or sin, respectively), but that Bass 
Library provides the additional option of getting to Hell by digging. This 
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gives a matrix 
111 
a= ; 1 1 , 


We can now compute the product 


01 
fi i OM ei Tete) tii) - 
aera) =|) 1 : : : ~ |2-041-241-2 ee 


One special matrix J (for each dimension n x n) has the property that 


[A= Aand BI = B for all matrices A and B with compatible dimension. 


This matrix is known as the identity matrix, and is defined by the rule 
I, = 1 and J; = 0 fori # j. It is not hard to see that in this case 
(LA)ij = Op Lin Any = Li Aiy = Aij, giving [A = A; a similar computation 
shows that BI = B. With a little more effort (omitted here) we can show 
that J is the unique matrix with this identity property. 


13.3.2.4 The inverse of a matrix 


A matrix A is invertible if there exists a matrix A~! such that A4A7! = 
A-'A =I. This is only possible if A is square (because otherwise the 
dimensions don’t work) and may not be possible even then. Note that it is 
enough to find a matrix such that A~!A = J to show that A is invertible. 

To try to invert a matrix, we start with the pair of matrices A, I (where 
I is the identity matrix defined above) and multiply both sides of the 
pair from the left by a sequence of transformation matrices B,, Bo,... Br 
until B,Bp_1-:-B,A = I. At this point the right-hand matrix will be 
B,By_1 ++» By = A7. (We could just keep track of all the B;, but it’s easier 
to keep track of their product.) 

How do we pick the B;? These will be matrices that (a) multiply some 
row by a scalar, (b) add a multiple of one row to another row, or (c) swap 
two rows. We'll use the first kind to make all the diagonal entries equal one, 
and the second kind to get zeroes in all the off-diagonal entries. The third 
kind will be saved for emergencies, like getting a zero on the diagonal. 

That the operations (a), (b), and (c) correspond to multiplying by a 
matrix is provable but tedious.” Given these operations, we can turn any 


°The tedious details: To multiply row r by a, use a matrix B with B;, = 1 wheni¥ r, 
B,y =a, and Bi; = 0 for i 7; to add a times row r to row s, use a matrix B with By, = 1 
when i#r, B-; =a, and B,;; = 0 for all other pairs 17; to swap rows r and s, use a matrix 
B with By, =1 for i ¢ {r,s}, Brs = Bsr = 1, and Bi; = 0 for all other pairs ij. 


| 


4 2 
4 3 


| 
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invertible matrix A into J by working from the top down, rescaling each 
row i using a type (a) operation to make A;; = 1, then using a type (b) 
operation to subtract Aj; times row i from each row j > 7 to zero out Ajj, 
then finally repeating the same process starting at the bottom to zero out all 
the entries above the diagonal. The only way this can fail is if we hit some 
Ay, = 0, which we can swap with a nonzero Aj; if one exists (using a type 
(c) operation). If all the rows from 7 on down have a zero in the i column, 
then the original matrix A is not invertible. This entire process is known as 
Gauss-Jordan elimination. 

This procedure can be used to solve matrix equations: if AX = B, and 
we know A and B, we can compute X by first computing A~! and then 
multiplying X = A~'AX = A~'B. If we are not interested in A~! for 
its own sake, we can simplify things by substituting B for J during the 
Gauss-Jordan elimination procedure; at the end, it will be transformed to X. 


Example Original A is on the left, J on the right. 
Initial matrices: 


Divide top row by 2: 


Subtract top row from middle row and 3-top row from bottom row: 


Tem ig 1/2 0 0 

OG 9) Jeane a 

Or fA |-378 Oia 
Swap middle and bottom rows: 

tO. 2/2 1/2 0 

OTe aol: 3a 

OO) aya) \=17e 4 


Multiply bottom row by 2: 
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tO 12 1/2 0 0 
0.4478) 2879: Ou7 
ONO st ah O70 


Subtract 5‘bottom row from top and middle rows: 


1 0 0 Le Sh *0 
0 1 0 -1 -1 1 
0 0 1 -1 2 0 


and we’re done. (It’s probably worth multiplying the original A by the 
alleged A~! to make sure that we didn’t make a mistake.) 


13.3.2.5 Scalar multiplication 


Suppose we have a matrix A and some constant c. The scalar product cA 
is given by the rule (cA);; = cAj;; in other words, we multiply (or scale) each 
entry in A by c. The quantity c in this context is called a scalar; the term 
scalar is also used to refer to any other single number that might happen to 
be floating around. 

Note that if we only have scalars, we can pretend that they are 1 x 1 
matrices; a+ 6 = aj, + by, and ab = aj 1b1;. But this doesn’t work if we 
multiply a scalar by a matrix, since cA (where c is considered to be a matrix) 
is only defined if A has only one row. Hence the distinction between matrices 
and scalars. 


13.3.3 Matrix identities 


For the most part, matrix operations behave like scalar operations, with a 
few important exceptions: 


1. Matrix multiplication is only defined for matrices with compatible 
dimensions. 


2. Matrix multiplication is not commutative: in general, we do not expect 
that 4B = BA. This is obvious when one or both of A and B is not 
square (one of the products is undefined because the dimensions aren’t 
compatible), but may also be true even if A and B are both square. 


For a simple example of a non-commutative pair of matrices, consider 


bE VETERE 
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On the other hand, matrix multiplication is associative: A(BC) = (AB)C. 
The proof is by expansion of the definition. First compute 


(A(BC)) ij = 2 Ain( BCs 
= y YS AnBniCng- 
Then compute 
((AB)C)i3 = DAB)imCms 
= d, dX Aix BumCmj 
= d De Aix BumCmj 
= (A(BC));j. 


So despite the limitations due to non-compatibility and non-commutativity, 
we still have: 


Associative laws A+ (B+C) = (A+ B)+C (easy), (AB)C = A(BC) 
(see above). Also works for scalars: c(AB) = (cA)B = A(cB) and 
(cd) A = c(dA) = d(cA). 


Distributive laws A(B+C) = AB+ BC, A(B+C) = AB+ AC. Also 
works for scalars: c(A+ B) =cA+cB,(c+d)A=cA+dA. 


Additive identity A+0=0+ A = A, where 0 is the all-zero matrix of 
the same dimension as A. 


Multiplicative identity AJ = A,ITA = A,1A = A,Al1 = A, where I is 
the identity matrix of appropriate dimension in each case and 1 is the 
scalar value 1. 


Inverse of a product (AB)! = B-'A7!. Proof: (B~'A~')(AB) = 
B-1(A-1A)B = B-1(IB) = B-'B =I, and similarly for (AB)(B71A7!). 
Transposes (A+ B)' = A' +B" (easy), (AB)' = B'A' (a little trick- 


ier). (A7!)' = (A™)~1, provided A7! exists (Proof: A‘™(A7!)' = 
(A“1A)' =I7=J). 
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Using these identities, we can do arithmetic on matrices without know- 
ing what their actual entries are, so long as we are careful about non- 
commutativity. So for example we can compute 


(A+B)? = (A+ B)(A+ B) = A?+AB+ BA+ B’. 
Similarly, if for square A we have 
n=0 
(where A° = J), we can solve the equation 
S=I+AS 
by first subtracting AS from both sides to get 
IS—-AS=I1 
then applying the distributive law: 
(I-A)S=I 
and finally multiplying both sides from the left by (I — .A)~! to get 
S=(I1- A)", 


assuming J — A is invertible. 


13.4 Vectors as matrices 


Matrices give us an alternative representation for vectors, which allows us to 
extend matrix multiplication to vectors as well. We’ll abuse terminology a 
bit by referring to a 1 x n or n x 1 matrix as a vector. 

A 1x n matrix is called a row vector for obvious reasons; similarly, 
an n x 1 matrix is called a column vector. It’s a good idea when using 
a vector as a matrix to specify whether to interpret it as a row vector or 
a column vector. You can also convert between the two forms using the 
transpose operator x’. 

Vectors defined in this way behave exactly like matrices in every respect. 
However, they are often written with lowercase letters to distinguish them 
from their taller and wider cousins. If this will cause confusion with scalars, 
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we can disambiguate by writing vectors with a little arrow on top: # or in 
boldface: x. Often we will just hope it will be obvious from context which 
variables represent vectors and which represent scalars, since writing all the 
little arrows can take a lot of time. 

When extracting individual coordinates from a vector, we omit the boring 
index and just write x1, 22, etc. This is done for both row and column vectors, 
so rather than write x; we can just write 2;. 

Vector addition and scalar multiplication behave exactly the same for 
vectors-as-matrices as they did for our definition of vectors-as-sequences. 
This justifies treating vectors-as-matrices as just another representation for 
vectors. 


13.4.1 Length 


The length of a vector x, usually written as ||x|| or sometimes just |z|, is 
defined as \/S7, 2?. The definition follows from the Pythagorean theorem: 


\|a||? = 3>2?. Because the coordinates are squared, all vectors have non- 
negative length, and only the zero vector has length 0. 

Length interacts with scalar multiplication exactly as you would expect: 
||ca|| = |c|-||a||. The length of the sum of two vectors depends on how they are 
aligned with each other, but the triangle inequality ||x + y|| < |||] + ||y|| 
always holds.°® 

A special class of vectors are the unit vectors, those vectors x for which 
||x|| = 1. In geometric terms, these correspond to all the points on the surface 
of a radius-1 sphere centered at the origin. Any vector x can be turned into 
a unit vector x/||x|| by dividing by its length. In two dimensions, the unit 


: 
vectors are all of the form [cos @ sin 6] , where by convention 6 is the angle 
from due east measured counterclockwise; this is why traveling 9 units north- 

7 Te 
west corresponds to the vector 9|cos 135° sin 136°] — |-9/v2 9/v3] : 


In one dimension, the unit vectors are [+1]. There are no unit vectors in 
zero dimensions: the unique zero-dimensional vector has length 0. 


°These properties make ||2|| an example of a norm. A norm is any real-valued function 
f on vectors that is positive for all nonzero vectors, zero for the zero vector, and satisfies 
the scaling property f(cx) = |c|- f(x) and the triangle inequality f(a + y) < f(a) + f(y). 
Other examples of norms are the 4; norm |ja||, = )7,|«i|, the 4. norm ||x||,, = max;|z;|, 


and the general ¢, norm ||z||,, = (X,lei?)””. 


The length of a vector is also known as the 2 norm or Euclidean norm. Like the (1 
norm (and even the £.. norm if you squint at it just right), the 02 norm is a special case of 
the @) norm. 
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13.4.2 Dot products and orthogonality 


Suppose we have some column vector x, and we want to know how far x 
sends us in a particular direction, where the direction is represented by a 
unit column vector e. We can compute this distance (a scalar) by taking the 


dot product 
e-x=e'r= Sew 


For example, if x = [3 4| and e = [1 0| ie then the dot product is 


-e-c=|I 0| A =1-340-4=3. 


33 
In this case we see that the [1 0| vector conveniently extracts the first 
coordinate, which is about what we’d expect. But we can also find out how far 


x takes us in the [1/v2 1/v2) direction: this is [1/v2 1/2] = 7/V/2. 

By convention, we are allowed to take the dot product of two row vectors 
or of a row vector times a column vector or vice versa, provided of course that 
the non-boring dimensions match. In each case we transpose as appropriate 
to end up with a scalar when we take the matrix product. 

Nothing in the definition of the dot product restricts either vector to 
be a unit vector. If we compute x-y where x = ce and |le|| = 1, then 
we are effectively multiplying e-y by c. It follows that the dot product is 
proportional to the length of both of its arguments. This often is expressed 
in terms of the geometric formulation, memorized by vector calculus students 
since time immemorial: 

The dot product of x and y is equal to the product of their lengths times 
the cosine of the angle between them. 

This formulation is a little misleading, since modern geometers will often 
define the angle between two vectors x and y as cos~! (getiar) but it gives 
a good picture of what is going on. One can also define the dot-product as 
the area of the parallelogram with sides x and y, with the complication that 
if the parallelogram is flipped upside-down we treat the area as negative. 
The simple version in terms of coordinates is harder to get confused about, 
so we'll generally stick with that. 

Two vectors are orthogonal if their dot product is zero. In geometric 
terms, this occurs when either one or both vectors is the zero vector or when 
the angle between them is +90° (since cos(+90°) = 0). In other words, two 
non-zero vectors are orthogonal if and only if they are perpendicular to each 
other. 
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Orthogonal vectors satisfy the Pythagorean theorem: If «-y = 0, then 
lle + yl? = (e+y)-(e+y) =e et+e-ytysty-y=e-rty-y = |lall?+\|lyll’. 
It is not hard to see that the converse is also true: any pair of vectors for 
which ||a + y||? = ||x||? + |ly||? must be orthogonal (at least in R”). 

Orthogonality is also an important property of vectors used to define 
coordinate systems, as we will see below. 


13.5 Linear combinations and subspaces 


A linear combination of a set of vectors 2, ...2p is any vector that can be 
expressed as >> c;x; for some coefficients c;. The span of the vectors, written 
(1 ...%n), is the set of all linear combinations of the 2;.’ 

The span of a set of vectors forms a subspace of the vector space, where 
a subspace is a set of vectors that is closed under linear combinations. This 
is a succinct way of saying that if x and y are in the subspace, so is ax + by 
for any scalars a and 6. We can prove this fact easily: if « = >> ca; and 
y = >2d,a;, then ax + by = >7(ac; + bd;)2;. 

A set of vectors 21,2%2,...,%, is linearly independent if there is no 
way to write one of the vectors as a linear combination of the others, i.e., 
if there is no choice of coefficients that makes some 2; = ))j4;¢jvj. An 
equivalent definition is that there is no choice of coefficients c; such that 
>> Gx; = 0 and at least one c; is nonzero (to see the equivalence, subtract x; 
from both sides of the x; = }> cjx; equation). 


13.5.1 Bases 


If a set of vectors is both (a) linearly independent, and (b) spans the entire 
vector space, then we call that set of vectors a basis of the vector space. 
An example of a basis is the standard basis consisting of the vectors 
[10...00]',[01...00]",...,[00...10]",[00...01]". This has the additional 
nice property of being made of of vectors that are all orthogonal to each 
other (making it an orthogonal basis) and of unit length (making it a 
normal basis). 

A basis that is both orthogonal and normal is called orthonormal. 
We like orthonormal bases because we can recover the coefficients of some 
arbitrary vector v by taking dot-products. If v = dJajx;, then v- 2; = 


"Technical note: If the set of vectors {x,} is infinite, then we will only permit linear 
combinations with a finite number of nonzero coefficients. We will generally not consider 
vector spaces big enough for this to be an issue. 
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Y aj(a;- xj) = aj, since orthogonality means that x;-«; = 0 when i ¥ j, and 
normality means 2; - 7; = |\2;||? = 1. 

However, even for non-orthonormal bases it is still the case that any 
vector can be written as a unique linear combination of basis elements. This 
fact is so useful we will state it as a theorem: 


Theorem 13.5.1. If {xij} is a basis for some vector space V, then every 
vector y has a unique representation y = a,x, + a2%q +-+++ an%y. 


Proof. Suppose there is some y with more than one representation, i.e., there 
are sequences of coefficients a; and b; such that y = a17,+a9%g+---+4n%p = 
bya, + borg +--+ +bntn. Then 0 = y—y = a, a1 + G22 4+-+++GnEn — 6141+ 
beta +--+ + Onn = (a1 — b1)x1 + (a2 — b2)%Q +--+ + (An — bn) Ln. But since 
the x; are independent, the only way a linear combination of the x; can equal 
O is if all coefficients are 0, i.e., if a; = b; for all 7. 


Even better, we can do all of our usual vector space arithmetic in terms 
of the coefficients a;. For example, if a = So aja; and b = >> b;2;, then it can 
easily be verified that a + b = So(a; + b;)a; and ca = °(ca;)2;. 

However, it may be the case that the same vector will have different 
representations in different bases. For example, in R?, we could have a basis 
B, = {(1,0), (0,1)} and a basis By = {(1,0),(1,-—-2)}. Because By is the 
standard basis, the vector (2,3) is represented as just (2,3) using basis By, 
but it is represented as (5/2, —3/2) in basis Bo. 

Both bases above have the same size. This is not an accident; if a vector 
space has a finite basis, then all bases have the same size. We’ll state this as 
a theorem, too: 


Theorem 13.5.2. Let 71 ...%pn and y,...Ym be two finite bases of the same 
vector space V. Thenn =m. 


Proof. Assume without loss of generality that n < m. We will show how 
to replace elements of the x; basis with elements of the y; basis to produce 
a new basis consisting only of y,...yn. Start by considering the sequence 
Y1,%1...%y. This sequence is not independent since y; can be expressed as 
a linear combination of the x; (they’re a basis). So from Theorem 1 there 
is some 2; that can be expressed as a linear combination of 1,71... 27j-1. 
Swap this x; out to get a new sequence 1, %1...%j-1,%j41,---%n. This new 
sequence is also a basis, because (a) any z can be expressed as a linear 
combination of these vectors by substituting the expansion of x; into the 
expansion of z in the original basis, and (b) it’s independent, because if 
there is some nonzero linear combination that produces 0 we can substitute 
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the expansion of x; to get a nonzero linear combination of the original 
basis that produces 0 as well. Now continue by constructing the sequence 
Y2,Y1,01---Uj—-1,%i41,---%p, and arguing that some x, in this sequence 
must be expressible as a combination of earlier terms by Theorem 13.5.1 (it 
can’t be y; because then y2, y; is not independent), and drop this x. By 
repeating this process we can eventually eliminate all the x;, leaving the 
basis Yn,---,y1- But then any yz for k > n would be a linear combination of 
this basis, so we must have m = n. 


The size of any basis of a vector space is called the dimension of the 
space. 


13.6 Linear transformations 


When we multiply a column vector by a matrix, we transform the vector into 
a new vector. This transformation is linear in the sense that A(a + y) = 
Ax + Ay and A(cx) = cAz; thus we call it a linear transformation. 
Conversely, any linear function f from column vectors to column vectors 
can be written as a matrix M such that f(#) = Ma. We can prove this by 
decomposing each x using the standard basis. 


Theorem 13.6.1. Let f : R” — R”™ be a linear transformation. Then there 
is a unique n x m matrix M such that f(x) = Mz for all column vectors x. 


Proof. We'll use the following trick for extracting entries of a matrix by 
multiplication. Let M be an n x m matrix, and let e’ be a column vector 
with e; = 1 if = 7 and 0 otherwise.® Now observe that (e’)'’ Mei = 
Dp eh. (Me?) = (Me); = 1, Mine}, = Mij. e given a particular linear f, 
we will now define M by the rule Mj; = (e") f(e?). It is not hard to see 
that this gives f(e’) = Me’ for each basis vector j, since multiplying by 
(e')' grabs the i-th coordinate in each case. To show that Mx = f(x) for 
all z, decompose each x as >, che". Now compute f(x) = f(2, cre”) = 
DE crf (e*) =D cr. M (e*) = MX cre") = Mz. 

’We are abusing notation by not being specific about how long e’ is; we will use the 
same expression to refer to any column vector with a 1 in the 7-th row and zeros everywhere 
else. We are also moving what would normally be a subscript up into the superscript 


position to leave room for the row index—this is a pretty common trick with vectors and 
should not be confused with exponentiation. 
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13.6.1 Composition 


What happens if we compose two linear transformations? We multiply the 
corresponding matrices: 


(9° f)(x) = g(f(@)) = g( Myx) = Mg(Mpa) = (MyM). 


This gives us another reason why the dimensions have to be compatible 
to take a matrix product: If multiplying by an n x m matrix A gives a map 
g: R™ > R", and multiplying by a k x | matrix B gives a map f : R! > R*, 
then the composition go f corresponding to AB only works if m = k. 


13.6.2 Role of rows and columns of M in the product Mx 


When we multiply a matrix and a column vector, we can think of the matrix 
as a sequence of row or column vectors and look at how the column vector 
operates on these sequences. 

Let Mj. be the i-th row of the matrix (the “-” is a stand-in for the missing 
column index). Then we have 


k 


So we can think of Mz as a vector of dot-products between the rows of 
M and z: 
123 : — fa,2,3)- (1,1,2)] _ [9 
4 5 6]|,|  [(4,5,6)-(1,1,2)} [21] 


Alternatively, we can work with the columns M_; of IM. Now we have 


(Mz): = 5° Mine = S>(M x) ite. 
k 


k 


From this we can conclude that Mz is a linear combination of columns 
of M: Mz = 3°, 2,.M.,. Example: 
ill 
ey al 


5 2 ; : = 
4 5 6 9 

The set {Max} for all x is thus equal to the span of the columns of M; it 
is called the column space of M. 


1 
4 


2 
5 


3 
6 


1 
4 


7 
12 


+1/7|+2/2) =[,]/ + 2/4 


5 
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For yM, where y is a row vector, similar properties hold: we can think 
of yM either as a row vector of dot-products of y with columns of M orasa 
weighted sum of rows of M; the proof follows immediately from the above 
facts about a product of a matrix and a column vector and the fact that 
yM =(M‘y')'. The span of the rows of M is called the row space of M, 
and equals the set {yM} of all results of multiplying a row vector by M. 


13.6.3 Geometric interpretation 


Geometrically, linear transformations can be thought of as changing the 
basis vectors for a space: they keep the origin in the same place, move the 
basis vectors, and rearrange all the other vectors so that they have the same 
coordinates in terms of the new basis vectors. These new basis vectors are 
easily read off of the matrix representing the linear transformation, since they 
are just the columns of the matrix. So in this sense all linear transformations 
are transformations from some vector space to the column space of some 
matrix.® 

This property makes linear transformations popular in graphics, where 
they can be used to represent a wide variety of transformations of images. 
Below is a picture of an untransformed image (top left) together with two 
standard basis vectors labeled x and y. In each of the other images, we have 
shifted the basis vectors using a linear transformation, and carried the image 
along with it.!° 


°The situation is slightly more complicated for infinite-dimensional vector spaces, but 
we will try to avoid them. 

10The thing in the picture is a Pokémon known as a Wooper, which evolves into a 
Quagsire at level 20. This evolution is not a linear transformation. 
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Note that in all of these transformations, the origin stays in the same 
place. If you want to move an image, you need to add a vector to everything. 
This gives an affine transformation, which is any transformation that can 
be written as f(x) = Ax+b for some matrix A and column vector b. One nifty 
thing about affine transformations is that—like linear transformations—they 
compose to produce new transformations of the same kind: A(Ca + d)+b= 
(AC) a + (Ad + 6). 

Many two-dimensional linear transformations have standard names. The 
simplest transformation is scaling, where each axis is scaled by a constant, 
but the overall orientation of the image is preserved. In the picture above, 
the top right image is scaled by the same constant in both directions and 
the second-from-the-bottom image is scaled differently in each direction. 

Recall that the product Mz corresponds to taking a weighted sum of 
the columns of M, with the weights supplied by the coordinates of x. So in 
terms of our basis vectors x and y, we can think of a linear transformation as 
specified by a matrix whose columns tell us what vectors for replace x and y 
with. In particular, a scaling transformation is represented by a matrix of 


the form 
Sz O 
O° sy)" 


where s, is the scale factor for the x (first) coordinate and s, is the scale 
factor for the y (second) coordinate. Flips (as in the second image from the 
top on the right) are a special case of scaling where one or both of the scale 
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factors is -1. 

A more complicated transformation, as shown in the bottom image, is a 
shear. Here the image is shifted by some constant amount in one coordinate 
as the other coordinate increases. Its matrix looks like this: 


i) 


Here the x vector is preserved: (1,0) maps to the first column (1,0), but 
the y vector is given a new component in the x direction of c, corresponding 
to the shear. If we also flipped or scaled the image at the same time that 
we sheared it, we could represent this by putting values other than 1 on the 
diagonal. 

For a rotation, we will need some trigonometric functions to compute the 
new coordinates of the axes as a function of the angle we rotate the image by. 
The convention is that we rotate counterclockwise: so in the figure above, 
the rotated image is rotated counterclockwise approximately 315° or —45°. 
If © is the angle of rotation, the rotation matrix is given by 


ie 6 —sin / 


sind cos 


For example, when 0 = 0°, then we have cos@ = 1 and sin # = 0, giving 
the identity matrix. When @ = 90°, then cos@ = O and sin@ = 1, so 
we rotate the x axis to the vector (cos 6,sin@) = (0,1) and the y axis to 
(— sin 8,cos@) = (—1,0). This puts the x axis pointing north where the y 
axis used to be, and puts the y axis pointing due west. 


13.6.4. Rank and inverses 


The dimension of the column space of a matrix—or, equivalently, the dimen- 
sion of the range of the corresponding linear transformation—is called the 
rank. The rank of a linear transformation determines, among other things, 
whether it has an inverse. 


Theorem 13.6.2. If f : R” > R™ is a linear transformation with an inverse 
f—', then we can show all of the following: 


1. f~' is also a linear transformation. 


2.n=m, and f has full rank, i.e., rank(f) = rank(f~!) = m. 
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Proof. 1. Let x and y be elements of codomain(f) and let a be a scalar. 
Then f(af~*(x)) = a(f(f-*(2))) = az, implying that f~*(ax) = 
af~"(x). Similarly, f(f—'(2) + f(y) = FF" (@)) + FFT) = 
x+y, giving f'(a2+y) = f-'(z) + fly). So f-? is linear. 


2. Suppose n < m. Pick any basis e’ for R”, and observe that {f (e)} 
spans range(f) (since we can always decompose 2 as S~a;e’ to get 
f(x) = Xiaif(e’)). So the dimension of range(f) is at most n. If 
n <m, then range(f) is a proper subset of R™ (otherwise it would 
be m-dimensional). This implies f is not surjective and thus has no 
inverse. Alternatively, if m <n, use the same argument to show that 
any claimed f~! isn’t. By the same argument, if either f or f~! does 
not have full rank, it’s not surjective. 


The converse is also true: If f : R” — R” has full rank, it has an inverse. 
The proof of this is to observe that if dim(range(f)) = n, then range(f) = R” 
(since R” has no full-dimensional subspaces). So in particular we can take 
any basis {e’} for R” and find corresponding {x’} such that f(z’) = e’. Now 
the linear transformation that maps >> a;e’ to )> aja’ is an inverse for f, 


since f(0 aia") = Dai f (xi) = D aie’. 


13.6.5 Projections 


Suppose we are given a low-dimensional subspace of some high-dimensional 
space (e.g. a line (dimension 1) passing through a plane (dimension 2)), and 
we want to find the closest point in the subspace to a given point in the 
full space. The process of doing this is called projection, and essentially 
consists of finding some point z such that (a — z) is orthogonal to any vector 
in the subspace. 

Let’s look at the case of projecting onto a line first, then consider the 


more general case. 
x 


0 b y=cb db 
A line consists of all points that are scalar multiples of some fixed vector 
b. Given any other vector x, we want to extract all of the parts of x that lie 
in the direction of b and throw everything else away. In particular, we want 
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to find a vector y = cb for some scalar c, such that (x — y)-b =0. This is is 
enough information to solve for c. 

We have (x — cb)-b = 0, so x-b = c(b- b) or c= (a-b)/(b- 6). So the 
projection of z onto the subspace {cb | c € R} is given by y = b(a - b)/(b- b) 
or y = b(a- b)/||b||?. If b is normal (ie. if ||b|| = 1), then we can leave out 
the denominator; this is one reason we like orthonormal bases so much. 

Why is this the right choice to minimize distance? Suppose we pick some 
other vector db instead. Then the points x, cb, and db form a right triangle 
with the right angle at cb, and the distance from «x to db is || — db|| = 
Viz — eb? + |leb = ab)? > lla = cb). 

But now what happens if we want to project onto a larger subspace? For 
example, suppose we have a point x in three dimensions and we want to 
project it onto some plane of the form {c,b; + cgb2}, where 6; and by span 
the plane. Here the natural thing to try is to send x to y = bi (x: b1)/|Jb1||? + 
bo(a - bz)/||b2||?. We then want to argue that the vector (x — y) is orthogonal 
to any vector of the form c,b; + cgb2. As before, (x — y) is orthogonal to any 
vector in the plane, it’s orthogonal to the difference between the y we picked 
and some other z we didn’t pick, so the right-triangle argument again shows 
it gives the shortest distance. 

Does this work? Let’s calculate: (a — y)-(c1b1 + cab) = x: (c1b1 + c2b2) — 
(bs (a - b1)/\[ba |? + b2(w - bz) /||b2|I?) - (crb1 + cab2) = en (a+ br — (br bi)(a- 
by) /(b1 + 1) + €2(a@ + bz — (bg - bz) (@ + bz) /(b2 + bz)) — e1(b1 - bz) (a + b1)/(b1 - b1) — 
c2(b1 - bz) (x + bz) /(b2 - b2). 

The first two terms cancel out very nicely, just as in the one-dimensional 
case, but then we are left with a nasty (b; - b2)(much horrible junk) term at 
the end. It didn’t work! 

So what do we do? We could repeat our method for the one-dimensional 
case and solve for cy and cg directly. This is probably a pain in the neck. 
Or we can observe that the horrible extra term includes a (0b; - bz) factor, 
and if b; and 62 are orthogonal, it disappears. The moral: We can project 
onto a 2-dimensional subspace by projecting independently onto the 1- 
dimensional subspace spanned by each basis vector, provided the basis vectors 
are orthogonal. And now we have another reason to like orthonormal bases. 

This generalizes to subspaces of arbitrary high dimension: as long as the 
b; are all orthogonal to each other, the projection of 2 onto the subspace (b;) 
is given by 37}; (x - b;)/||b;||?. Note that we can always express this as matrix 
multiplication by making each row of a matrix B equal to one of the vectors 
b; /||b;||; the product Bz then gives the coefficients for the basis elements in 
the projection of x, since we have already seen that multiplying a matrix by 
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a column vector corresponds to taking a dot product with each row. If we 
want to recover the projected vector > c;b; we can do so by taking advantage 
of the fact that multiplying a matrix by a column vector also corresponds to 
taking a linear combination of columns: this gives a combined operation of 
B' Ba which we can express as a single projection matrix P = B'B. So 
projection corresponds to yet another special case of a linear transformation. 

One last detail: suppose we aren’t given orthonormal 6; but are instead 
given some arbitrary non-orthogonal non-normal basis for the subspace. 
Then what do we do? 

The trick here is to use a technique called Gram-Schmidt orthogonaliza- 
tion. This constructs an orthogonal basis from an arbitrary basis by induction. 
At each step, we have a collection of orthogonalized vectors b; ...b; and some 
that we haven’t processed yet az41...@m; the induction hypothesis says that 
the b;...b, vectors are (a) orthogonal and (b) span the same subspace as 
a,...az. The base case is the empty set of basis vectors, which is trivially or- 
thogonal and also trivially spans the subspace consisting only of the 0 vector. 
We add one new vector to the orthogonalized set by projecting az,41 to some 
point c on the subspace spanned by 6; ...,b,; we then let bg41 = ag41 —. 
This new vector is orthogonal to all of b ...b, by the definition of orthogonal 
projection, giving a new, larger orthogonal set b,...b441. These vectors 
span the same subspace as a,...a%41 because we can take any vector x 
expressed as Se ciaj, and rewrite it as \*_, cibi + ceri(e + bey1), and 
in the second term cz1 c reduces to a linear combination of b,... bz; the 
converse essentially repeats this argument in reverse. It follows that when 
the process completes we have an orthogonal set of vectors b1...bm that 
span precisely the same subspace as a, ...@m, and we have our orthogonal 
basis. (But not orthonormal: if we want it to be orthonormal, we divide 
each b; by ||b;|| as well.) 


13.7 Further reading 


Linear algebra is a key tool in graphics, scientific computing, robotics, neural 
networks, and many other areas of Computer Science. If you do further work 
in these areas, you will quickly find that we have not covered anywhere near 
enough linear algebra in this course. Your best strategy for remedying this 
deficiency may be to take an actual linear algebra course; failing that, a very 
approachable introductory text is Linear Algebra and Its Applications, by 
Gilbert Strang [Str05]. You can also watch an entire course of linear algebra 
lectures through YouTube: http://www. youtube.com/view_play_list?p= 
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E7DDD91010BC51F8. 
Some other useful books on linear algebra: 


e Golub and Van Loan, Matrix Computations [GVL12]. Picks up where 
Strang leaves off with practical issues in doing computation. 


e Halmos, Finite-Dimensional Vector Spaces [Hal58]. Good introduction 
to abstract linear algebra: properties of vector spaces without jumping 
directly to matrices. 


Matlab (which is available on the Zoo machines: type ‘matlab‘ at a shell 
prompt) is useful for playing around with operations on matrices. There 
are also various non-commercial knockoffs like Scilab or Octave that are 
not as comprehensive as Matlab but are adequate for most purposes. Note 
that with any of these tools, if you find yourselves doing lots of numerical 
computation, it is a good idea to talk to a numerical analyst about round-off 
error: the floating-point numbers inside computers are not the same as real 
numbers, and if you aren’t careful about how you use them you can get very 
strange answers. 


Chapter 14 


Finite fields 


Our goal here is to find computationally-useful structures that act enough 
like the rational numbers Q or the real numbers R that we can do arithmetic 
in them that are small enough that we can describe any element of the 
structure uniquely with a finite number of bits. Such structures are called 
finite fields. 

An example of a finite field is Z,, the integers mod p (see §8.3). These 
finite fields are inconvenient for computers, which like to count in bits and 
prefer numbers that look like 2” to horrible nasty primes. So we'd really like 
finite fields of size 2” for various n, particularly if the operations of addition, 
multiplication, etc. have a cheap implementation in terms of sequences of 
bits. To get these, we will show how to construct a finite field of size p” for 
any prime p and positive integer n, and then let p = 2. 


14.1 A magic trick 


We will start with a magic trick. Suppose we want to generate a long 
sequence of bits that are hard to predict. One way to do this is using a 
mechanism known as a linear-feedback shift register or LFSR. There 
are many variants of LFSRs. Here is one that generates a sequence that 
repeats every 15 bits by keeping track of 4 bits of state, which we think of 
as a binary number 73797110. 

To generate each new bit, we execute the following algorithm: 


1. Rotate the bits of r left, to get a new number rarirors. 
2. If the former leftmost bit was 1, flip the new leftmost bit. 


3. Output the rightmost bit. 


283 
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Here is the algorithm in action, starting with r = 0001: 
r rotated r rotated r after flip output 


0001 0010 0010 0 
0010 0100 0100 0 
0100 1000 1000 0 
1000 0001 1001 1 
1001 0011 1011 1 
1011 0111 1111 1 
1111 1111 0111 1 
0111 1110 1110 0 
1110 1101 0101 1 
0101 1010 1010 0 
1010 0101 1101 1 
1101 1011 0011 1 
0011 0110 0110 0 
0110 1100 1100 0 
1100 1001 0001 1 
0001 0010 0010 0 


After 15 steps, we get back to 0001, having passed through all possible 
4-bit values except 0000. The output sequence 000111101011001... has the 
property that every 4-bit sequence except 0000 appears starting at one of 
the 15 positions, meaning that after seeing any 3 bits (except 000), both bits 
are equally likely to be the next bit in the sequence. We thus get a sequence 
that is almost as long as possible given we have only 2* possible states, that 
is highly unpredictable, and that is cheap to generate. So unpredictable 
and cheap, in fact, that the governments of both the United States! and 
Russia” operate networks of orbital satellites that beam microwaves into our 
brains carrying signals generated by linear-feedback shift registers very much 
like this one. Similar devices are embedded at the heart of every modern 
computer, scrambling all communications between the motherboard and PCI 
cards to reduce the likelihood of accidental eavesdropping. 

What horrifying deep voodoo makes this work? 


14.2 Fields and rings 


A field is a set F' together with two operations + and - that behave like 
addition and multiplication in the rationals or real numbers. Formally, this 


‘In the form of the Global Positioning System. 
*In the form of GLONASS. 
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means that: 
1. Addition is associative: (x+y) +2=2+(y+2) forall z,y,z in F. 


2. There is an additive identity 0 such that 0+ 2=2+0=~2 forall x 
in F. 


3. Every x in F has an additive inverse —zx such that x + (—2) = 
(-—z)+2=0. 


4. Addition is commutative: «+ y=y+-42 for all x,y in F. 


5. Multiplication distributes over addition: x-(y+z) =(x-y+a2-2z) 
and (y+z)-%=(y-2+2z-2) forall z,y,z in F. 


6. Multiplication is associative: (x-y)-z=a2-(y-z) for all x,y,z in F. 


7. There is a multiplicative identity 1 such that 1-72 =a-1= 42 for 
all x in F. 


8. Multiplication is commutative: «-y = y-2 for all x,y in F. 


9. Every x in F\{0} has a multiplicative inverse x! such that 2-2~! = 


xt-e=1. 


Some structures fail to satisfy all of these axioms but are still interesting 
enough to be given names. A structure that satisfies 1-3 is called a group; 
1-4 is an abelian group or commutative group; 1-7 is a ring; 1-8 is a 
commutative ring. In the case of groups and abelian groups there is only 
one operation +. There are also more exotic names for structures satisfying 
other subsets of the axioms.° 

Some examples of fields: R,Q,C,Z, where p is prime. We will be 
particularly interested in Zp, since we are looking for finite fields that can fit 
inside a computer. 

The integers Z are an example of a commutative ring, as is Zm for 
m > 1. Square matrices of fixed dimension greater than 1 are an example of 
a non-commutative ring. 


3A set with one operation that does not necessarily satisfy any axioms is a magma. 
If the operation is associative, it’s a semigroup, and if there is also an identity (but not 
necessarily inverses), it’s a monoid. For example, the set of nonempty strings with + 
interpreted as concatenation form a semigroup, and throwing in the empty string as well 
gives a monoid. 

Weaker versions of rings knock out the multiplicative identity (a pseudo-ring or rng) 
or negation (a semiring or rig). An example of a semiring that is actually useful is the 
(max, +) semiring, which uses max for addition and + (which distributes over max) for 
multiplication; this turns out to be handy for representing scheduling problems. 
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14.3. Polynomials over a field 


Any field F generates a polynomial ring F'[z] consisting of all polynomials 
in the variable x with coefficients in F’. For example, if F = Q, some elements 
of Q[z] are 3/5, (22/7)x? + 12, 900324!" — (32/3)x4 + 2, etc. Addition and 
multiplication are done exactly as you’d expect, by applying the distributive 
law and combining like terms: (#+1)-(2?+3/5) = a-x?+2-(3/5)+a@?+(3/5) = 
gz + ¢7 + (3/5)a + (3/5). 

The degree deg(p) of a polynomial p in F|z] is the exponent on the 
leading term, the term with a nonzero coefficient that has the largest 
exponent. Examples: deg(x? + 1) = 2, deg(17) = 0. For 0, which doesn’t 
have any terms with nonzero coefficients, the degree is taken to be —oo. 
Degrees add when multiplying polynomials: deg((x? + 1)(x +. 5)) = deg(a? + 
1) + deg(x +5) = 2+1 = 3; this is just a consequence of the leading terms in 
the polynomials we are multiplying producing the leading term of the new 
polynomial. For addition, we have deg(p+ q) < max(deg(p), deg(q)), but we 
can’t guarantee equality (maybe the leading terms cancel). 

Because Fz] is a ring, we can’t do division the way we do it in a field 
like R, but we can do division the way we do it in a ring like Z, leaving a 
remainder. The equivalent of the integer division algorithm for Z is: 


Theorem 14.3.1 (Division algorithm for polynomials). Given a polynomial 
f and a nonzero polynomial g in F{a], there are unique polynomials q and r 
such that f =q-g+r and deg(r) < deg(g). 


Proof. The proof is by induction on deg(f). If deg(f) < deg(g), let ¢q = 0 
and r = f. If deg(f) is larger, let m = deg(f), n = deg(g), and qn—n = 
fing,» Then qm—nx™ "g is a degree-m polynomial with leading term fin. 
Subtracting this from f gives a polynomial f’ of degree at most m — 1, 
and by the induction hypothesis there exist g', r such that f’ = q/-g+r 
and degr < degg. Let g = Qm_nx™ "+; then f = f! + dm—nv™ "g = 
(Qm-nt™ "gtqg+r=q-gtr. 


The essential idea of the proof is that we are finding g and r using the 
same process of long division as we use for integers. For example, in Q(z]: 


xr+1 

+2) 2743245 
aig Dep 

x+5 

—x-—2 


3 
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From this we get 2? + 324+5 = (x4 2)(x — 1) +3, with deg(3) = 0 < 
deg(x + 2) = 1. 

We are going to use division of polynomials to define finite fields by 
taking Fz] modulo some well-chosen polynomial, analogously to the way we 
can turn Z (a ring) into a field Z, by taking quotients mod p. As long as we 
choose the right polynomial, this works in any field. 


14.4 Algebraic field extensions 


Given a field F', we can make a bigger field by adding in extra elements 
that behave in a well-defined and consistent way. An example of this is the 
extension of the real numbers R to the complex numbers C by adding 7. 

The general name for this trick is algebraic field extension or just 
field extension, and it works by first constructing the ring of polynomials 
Fx] and then smashing it down into a field by taking remainders modulo 
some fixed polynomial p(x). For this to work, the polynomial has to to be 
irreducible, which mean that p(x) = 0 if and only if x = 0, or equivalently 
that p can’t be factored as (2+ a)p’ for some a and p’. This latter definition 
makes irreducibility sort of like being prime, and makes this construction 
sort of like the construction of Zp. 

The fact that the resulting object is a field follows from inheriting all the 
commutative ring properties from F'[a], plus getting multiplicative inverses 
for essentially the same reason as in Z,: we can find them using the extended 
Euclidean algorithm applied to polynomials instead of integers (we won’t 
prove this). 

In the case of the complex numbers C, the construction is C = R{i]/(i?+1). 
Because i? + 1 = 0 has no solution i € R, this makes i? + 1 an irreducible 
polynomial. An element of C is then a degree-1 or less polynomial in R{#], 
because these are the only polynomials that survive taking the remainder 
mod i? + 1 intact. 

If you’ve used complex numbers before, you were probably taught to 
multiply them using the rule i? = —1, which is a rewriting of 7? +1 = 0. 
This is equivalent to taking remainders: (i+ 1)(i +2) = (2 +3i+2) = 
1-(4+1) + (8¢+1) =3i4+1. 

The same thing works for other fields and other irreducible polynomials. 
For example, in Zs, the polynomial z7+2-+1 is irreducible, because z?-+2+1 = 
0 has no solution (try plugging in 0 and 1 to see). So we can construct 
a new finite field Zo[x]/(x? + x +1) whose elements are polynomials with 
coefficients in Zz with all operations done modulo 2? + x + 1. 
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Addition in Z[2] /(x?-+2+1) looks like vector addition:* (2+1)+(x#+1) = 
0-4£+0=0,(¢+1)+ 2 = 1,(1)+ (2) = (© +1). Multiplication in 
Z[x]/(x* +x +1) works by first multiplying the polynomials and taking the 


remainder mod (z7+2+1): (x+1):(2@+1) =2?+1=1-(2?+241)+2=2. 


If you don’t want to take remainders, you can instead substitute ++ 1 for any 
occurrence of x? (just like substituting —1 for i? in C), since 7? +2 +1=0 
implies x? = —z —1=2+1 (since —1 = 1 in Zz). 

The full multiplication table for this field looks like this: 


| 0 1 x xr+1 
0 0 0 0 0 
1 0 1 a r+ 
x 0 x x+1 1 
x+1/0 x41 1 x 


We can see that every nonzero element has an inverse by looking for ones 
in the table; e.g. 1-1 = 1 means 1 is its own inverse and x-(x+1) =2?+2=1 
means that « and x + 1 are inverses of each other. 

Here’s the same thing for Zo[x]/(a2? + x + 1): 


0 1 x r+1 x? x? +1 ee+a ge+aeti 
0 0 0 0 0 0 0 0 0 
1 0 1 x r+1 x g2+1 eta et+at+l 
x 0 x x e+ 2 xr+1 1 get+atl ge2+i 
z+1 0 r+1 ee+ea e+ ot+aet+l x 1 x 
x 0 x x+1 et+aet+l e+e © | 1 
g2+1 0 v?+i1 1 x x get+aetl x+1 ee+an 
e+e 0 e+e we? ek 1 1 e+ x+1 L x 
et+etl1|0 2#+241 e+ x 1 ee +2 x z+1 


Note that we now have 2? = 8 elements. In general, if we take Z,[z] 
modulo a degree-n polynomial, we will get a field with p” elements. These 
turn out to be all the possible finite fields, with exactly one finite field for each 
number of the form p” (up to isomorphism, which means that we consider 
two fields equivalent if there is a bijection between them that preserves + 
and -). We can refer to a finite field of size p” abstractly as GF(p"), which 
is an abbreviation for the Galois field p”. 


“This is not an accident: any extension field acts like a vector space over its base field. 
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14.5 Applications 


So what are these things good for? 

On the one hand, given an irreducible polynomial p(x) of degree n over 
Zo(x), it’s easy to implement arithmetic in Z2[x]/p(x) (and thus GF(2”)) 
using standard-issue binary integers. The trick is to represent each polynomial 
S~ a,x" by the integer value a = S> a;2", so that each coefficient a; is just the i- 
th bit of a. Adding two polynomials a+ represented in this way corresponds 
to computing the bitwise exclusive or of a and b: ab in programming 
languages that inherit their arithmetic syntax from C (i.e., almost everything 
except Scheme). Multiplying polynomials is more involved, although it’s 
easy for some special cases like multiplying by x, which becomes a left-shift 
(a<<1) followed by XORing with the representation of our modulus if we get 
a 1 in the n-th place. (The general case is like this but involves doing XORs 
of a lot of left-shifted values, depending on the bits in the polynomial we are 
multiplying by.) 

On the other hand, knowing that we can multiply 7 = 2? + «+1 by 
5 = a27+1 and get 6 = 2? + x quickly using C bit operations doesn’t help us 
much if this product doesn’t mean anything. For modular arithmetic (§8.3), 
we at least have the consolation that 7-5 = 6 (mod 29) tells us something 
about remainders. In GF(2?), what this means is much more mysterious. 
This makes it useful—not in contexts where we want multiplication to make 
sense—but in contexts where we don’t. These mostly come up in random 
number generation and cryptography. 


14.5.1 Linear-feedback shift registers 


Let’s suppose we generate x°, a',x?,... in Z2/(x* + 2° + 1), which happens 
to be one of the finite fields isomorphic to GF(2*). Since there are only 
2* — 1 = 15 nonzero elements in G'F'(2*), we can predict that eventually this 
sequence will repeat, and in fact we can show that p!° = 1 for any nonzero p 
using essentially the same argument as for Fermat’s Little Theorem. So we 
will have x° = «!° = x9 etc. and thus will expect our sequence to repeat 
every 15 steps (or possibly some factor of 15, if we are unlucky). 

To compute the actual sequence, we could write out full polynomials: 
1,z,2?,09,2°+1,22+2+1,..., but this gets tiresome fast. So instead we’d 
like to exploit our representation of 37 a;x" as > a;2". 

Now multiplying by x is equivalent to shifting left (i.e. multiplying by 2) 
followed by XORing with 11001, the binary representation of «+ + #° + 1, 
if we get a bit in the x* place that we need to get rid of. For example, we 
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might do: 


1101 (initial value) 
11010 (after shift) 
0011 (after XOR with 11001) 


or 


0110 (initial value) 
01100 (after shift) 
1100 (no XOR needed) 


If we write our initial value as rgrariro, the shift produces a new value 
r3rariro0. Then XORing with 11001 has three effects: (a) it removes a 
leading 1 if present; (b) it sets the rightmost bit to r3; and (c) it flips the 
new leftmost bit if r3 = 1. Steps (a) and (b) turn the shift into a rotation. 
Step (c) is the mysterious flip from our sequence generator. So in fact what 
our magic sequence generator was doing was just computing all the powers 
of x in a particular finite field. 

As in Zp, these powers of an element bounce around unpredictably, which 
makes them a useful (though cryptographically very weak) pseudorandom 
number generator. Because high-speed linear-feedback shift registers are 
very cheap to implement in hardware, they are used in applications where a 
pre-programmed, statistically smooth sequence of bits is needed, as in the 
Global Positioning System and to scramble electrical signals in computers to 
reduce radio-frequency interference. 


14.5.2 Checksums 


Shifting an LFSR corresponds to multiplying by x. If we also add 1 from 
time to time, we can build any polynomial we like, and get the remainder 
mod m; for example, to compute the remainder of 100101 mod 11001 we do 


0000 (start with 0) 
00001 (shift in 1) 
0001 (no XOR) 
00010 (shift in 0) 
0010 (no XOR) 
00100 (shift in 0) 
0100 (no XOR) 
01001 (shift in 1) 
1001 (no XOR) 
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10010 (shift in 0) 
1011 (XOR with 11001) 
10111 (shift in 1) 
1110 (XOR with 11001) 


and we have computed that the remainder of x° + 2? +1 mod 24+ 2° +1 is 
ge +a? +a. 

This is the basis for cyclic redundancy check (CRC) checksums, 
which are used to detect accidental corruption of data. The idea is that 
we feed our data stream into the LFSR as the coefficients of some gigantic 
polynomial, and the checksum summarizing the data is the state when we 
are done. Since it’s unlikely that a random sequence of flipped or otherwise 
damaged bits would equal 0 mod m, most non-malicious changes to the data 
will be visible by producing an incorrect checksum. 


14.5.3 Cryptography 


GF(2”) can also substitute for Z, in some cryptographic protocols. An 
example would be the function f(s) = x* (mod m), which is fairly easy to 
compute in Z, and even easier to compute in GF(2”), but which seems to 
be hard to invert in both cases. Here we can take advantage of the fast 
remainder operation provided by LFSRs to avoid having to do expensive 
division in Z. 


Appendix A 


Sample assignments from 
Fall 2017 


Assignments are typically due Wednesdays at 5:00 pm. Assignments should 
be uploaded to Canvas in PDF format. See Appendix G for some suggestions 
for how to format your solutions as PDF. 

Do not include any identifying information in your submissions. 
This will allow grading to be done anonymously. 

Make sure that your submissions are readable. You are strongly 
advised to use ATX, Microsoft Word, Google Docs, or similar software to 
generate typeset solutions. Scanned or photographed handwritten submis- 
sions often come out badly, and submissions that are difficult for the graders 
to read will be penalized. 

Sample solutions will appear in this appendix after the assignment is due. 
To maintain anonymity of both submitters and graders, questions about 
grading should be submitted through Canvas. 


A.1 Assignment 1: Due Wednesday, 2017-09-13, 
at 5:00 pm 


Bureaucratic part 


Send me email! My address is james.aspnes@gmail.com. 
In your message, include: 


1. Your name. 
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2. Your status: whether you are an undergraduate, grad student, auditor, 
etc. 


3. Anything else you’d like to say. 
(You will not be graded on the bureaucratic part, but you should do it 
anyway.) 
A.1.1 A curious proposition 
Consider the proposition 
((P > Q)- P)-Q (A.1.1) 


1. Write out a truth table in the style of §2.2.2 to determine for which 
assignments of truth values to P and Q this proposition is true. 


2. Show how to convert (A.1.1) into conjunctive normal form using stan- 
dard logical equivalences. 


3. Show how to convert (A.1.1) into disjunctive normal form using stan- 
dard logical equivalences. 


4. Show that the proposition 
P+ (Q—> (P>Q)) (A.1.2) 
is not logically equivalent to ((P > Q) > P) > Q. 


Solution 

1. 
PQ P3Q (P>~Q)>-P ((P>Q)->P)-Q 
0 O 1 0 1 
0 1 1 0 1 
1 O 0 1 0 
1 1 1 1 1 

2: 
(P>Q)> P) = Q=-((C-PVQ)VP)VQ 


(GP VQ) A>P)VQ 
=(“-PVQVQ)A(“PV Q) 
= (“PV Q)A (“PV Q) 
=APVQ 
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This is trivially in conjunctive normal form: it’s an AND of exactly one 
OR clause. (We could also have stopped a few steps earlier: nothing 
says that CNF can’t include duplicate clauses or variables.) 


We could further simplify this expression to P > Q, but then it 


wouldn’t be in CNF. 


3. In the previous step, we reduced (A.1.1) to ~P V Q, and observed 
that this expression is in conjunctive normal form. But it is also in 
disjunctive normal form, since it is an OR of two (trivial) AND-clauses. 


4. Here are two ways to do this: 


e Using a truth table: 


0 1 
0 
1 
1 


Ee Or 


1 
0 
1 


PQ PQ Q>(PQ) P#(Q>(P>Q)) 
0 


1 1 


1 1 
1 1 
1 1 


We can now observe that the rightmost column doesn’t match the 
rightmost column in the truth table for ((P > Q) — P) > Q. 


e Using logical equivalences: 


P>(Q 


> (P > Q)) = >P V (“QV (“PV Q)) 


= (4P V5P) Vv (-QV Q) 
=APVI1 
=]. 


Since we previously established that ((P > Q) > P) > Q=-Pv 
Q #1, this shows ((P > Q) > P) 9 Q#P>(Q > (P- Q)). 


A.1.2 Relations 


For this problem, you are given a predicate Pry that holds if x is a parent of 
y, and need to define other family relationships using the tools of first-order 


predicate logic (7, A, V, =, 


,V, 


!, etc.). For example, we could define Gay, 


meaning that x is a grandparent of y, using the axiom 


Gay + (dz: Prz A Pzy). 


For each of the predicates below, give a definition of the predicate based 
on P, in the form of an axiom that specifies when the predicate is true. 
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1. Let Hxy hold if x and y are half siblings, which means that x and y 
have exactly one common parent. 


2. Let Sxy hold if x and y are full siblings, which means that x and y 
have at least two parents in common. 


3. Suppose our society practices cousin marriage, where a marriage be- 
tween x and y is considered desirable if « and y have exactly one 
grandparent in common. Let May hold if this is the case. 


4. Let Axy hold if x is an ancestor of y. 


Solution 


1. Hey © (Alz: Pzax A Pzy). If we want to avoid using 3!, we can expand 
this as Hay © (az: Pzx A Pzy A (Vq: Pqx \ Pay > q = 2)). 


2. Sey & (az: dq: zA#qA Pzx A PzyA Pqx A Pay). 


3. May © (alg : dp: dq: Pap \ Ppx \ Pgq A Pay). This can also be 
expanded to use J instead of o!. 


4. This one is done using recursion: Ary © Pry V (dz : Axz A Azy). 
Replacing at most one of Axz or Azy with Pxz or Pzy (respectively) 
also works. 


A.1.3. A theory of shirts 


The following set of axioms attempts to describe the rules for shirt sizing. 
The predicates Sx, Mx, and Lax say that x is small, medium, or large, 
respectively, and the predicate Bry says that x is bigger than y. 


S,:da¢Sx 
M, daM x 
Iy :deLae 


Brom :VaVy(Le \ My) > Bry 
Bus :VaVy(Mza A Sy) > Bay 
T : VaVyV2z(Bay A Byz > Baz) 
I :VanBuex 
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A very small model for these axioms consists of three shirts s,m, and 
é, with Ss, Mm, Lf, Bem, Bms and Bés being an exclusive list of true 
predicate assignments.! This can be verified (tediously) by checking that 
each of the axioms holds. For example, T’ works because the only way to 
assign x, y, and z so that Bry and Byz are both true is to make x = @, 
y =m, and z= s; but then Brz is Bés which is true. 

For each of the following statements, prove (using the methods in §§2.4 
and 2.5) that it is a consequence of the above axioms, or describe a model in 
which the axioms hold but the statement does not. 


1. Va(Sx V Maz V La). 
2. VaVy(La A Sy) > Bay. 
3. Va(ASa V AMz). 


4, Vaedy(-Lz > Byz). 


Solution 


1. This is not true in general. Consider a model that adds to the very 
small model an extra shirt g, such that no predicate involving q is true. 
Axioms $1, M,, and Ly are still true, because s, m, and £ make them 
true. The remaining axioms also hold because they continue to hold 
for s, m, and £; setting any of the variables to q makes the premise 
of the implication false in Brjy, Bagg, or T; and setting x to q makes 
Bax false and thus ~Bzzx true in J. But in this model it is not the 
case that Vz: (Sx V Mx V La), because Sx V Ma V Lz is false when 
L=q. 


2. Proof: Fix x and y and suppose La and Sy both hold. Let m be any 
shirt for which Mm is true (at least one such shirt exists by Axiom 
M,). Then Bam (Axiom Bry) and Bmy (Axiom Bys). So Bay 
(Axiom T). 


‘Note that there is nothing special about the names s, m and é, which were chosen 
mostly to make it easier to remember which shirt satisfies which predicate. We could 
instead have made a model with, say, shirts named a, b, c, and d, satisfying precisely the 
predicates La, Mb, Mc, Sd, Bab, Bac, Bad, Bbc, Bbd, and Bcd. This model has two 
medium shirts, one of which (b) is bigger than the other one (c). It satisfies the axioms 
because Li holds for « = d; M, holds for x = b or x = c; Ly holds for x = a; Bru holds 
for the cases x = a and y= bor x =aand y=c; Bys holds for the cases x = b and y= d 
or x = cand y = d; T holds for all four possible choices of x, y, and z that make Bry and 
Byz true; and I holds because we were not foolish enough to set any of Baa, Bbb, Bcc, or 
Bdd to be true. 


APPENDIX A. SAMPLE ASSIGNMENTS FROM FALL 2017 297 


3. Proof: Suppose there is a shirt x with Sx and Mx. Then Bax (Axiom 
Bw). But this contradicts Axiom I. 


If we are uncomfortable with a proof by contradiction, we can turn 
the argument around using contraposition. The direct proof is: Fix x. 
Then ~Bax (Axiom J). Applying contraposition to the implication 
in Axiom Byg gives VaVy(=Bry > 7=(Ma Vv Sy)). We can further 
rewrite this using De Morgan’s law to get VaVy(-Bay > (~MaV-Sy)). 
Specialize y to x to get Vx(=Bux — (~MxV-Szx)). But we previously 
established VzBax, so this gives Va(>Ma V 7S). 


4. For this we can reuse the four-shirt model from the first case. Letting 
x = q makes —Lz true (since Lg is false), but it is also the case that 
for any choice of y, Byq is also false. So we have Vy7(-Lq > Byq) = 
ady(=Lq > Bygq), giving a counterexample to the statement. 


A.2 Assignment 2: Due Wednesday, 2017-09-20, 
at 5:00 pm 


A.2.1 Arithmetic, or is it? 


Suppose we have the following axioms, where 0 and 1 are constants and 
+ is a function of two arguments written in the usual infix notation. As 
usual, we adopt the convention that any unbound variables are universally 
quantified: so, for example, the axiom ++ y = y+ x should be read as 
VaVy:e2+y=yte. 


0#1 (A.2.1) 
r+0=2 (A.2.2) 
rty=yt2 (A.2.3) 

gt+(ytz)=(@+y)+z (A.2.4) 
(a+y=0) > (a =0Ay=0) (A.2.5) 


Define x < y to hold if and only if there exists some z # 0 such that 
T+Zz=y. 

For each of the following statements, give a proof that it follows from 
the above axioms, or construct a model in which the axioms hold but the 
statement is false. 


1.0<1. 
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2. Ife+z=y+z, then z= y. 
3. Ifa<y, thenx+2z2<y+z. 


4. Ifa<bandc<d,thena+c<b+d. 


Solution 


1. Proof: From (A.2.3) and (A.2.2) we have 0+1=1+0=1. Now apply 
the definition of < with  =0,z=1l,y=1. 


2. Counterexample model: Include only 0 and 1, with 0 4 1 (so (A.2.1) 
holds). Let x+y =1 if x=1 or y=1, and let x + y = 0 otherwise. 
Then (A.2.2) holds because xz + 0 is either 0 if x is 0 or 1 if a is 1; 
(A.2.3) holds because OR is commutative; (A.2.4) holds because OR is 
associative; and (A.2.5) is immediate from our definition of +. 


However, in this model we can have 0+1=1+1, but 041. 


3. Let « < y. Expanding the definition gives that there exists some 
q # 0 such that x + q = y. But then for any z, (rx +-q)+z=ytz 
(substitution rule), and applying (A.2.4) and (A.2.3) a few times gives 
(c+z)+q=y+z. Since q £0, this shows 7+ z < y+z. 


4. Let a< bandc< d. Then there exist q, r, both nonzero, such that 
a+q=bandc+r=d. Use substitution to show (a+ q)+(e+r) = 
b+ d, and use (A.2.4) and (A.2.3) to rewrite the left-hand side to get 
(at+c)+(q+r)=b+d. Because gq 4 0, (A.2.5) says q+ r 4 0, which 
givesa+ec<b+d. 


A.2.2 Some distributive laws 
Prove or disprove each of the following: 
1. For all sets A, B, C, and D: if AC Cand BC D, then ANB CCND. 


2. For all sets A, B, C, and D: if AC Cand BD D, then A\ BC C\D. 


Solution 


1. Proof: Let « € AN B. Then x € A CC implies x € C, and similarly 
zx €BC Dimplies x € D. Sox Ee CN D. Since x was arbitrary, 
we have Vr: 2 € ANB > x € CND, which is the definition of 
ANBCCND. 
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2. Proof: Let x € A\ B. Then x € A and « ¢ B, which gives x € C 
(since A C C) and x ¢ D (since DC B). Sore C\ D. 


A.2.3. Elements and subsets 


Suppose A, B, and C are all sets. 

In each of the following situations, show one of (a) A must be an element 
of C' but is not necessarily a subset of C; (b) A must be a subset of C' but is 
not necessarily an element of C; (c) A must be both an element and a subset 
of C; or (d) A is not necessarily either an element or subset of C. 


1 AE BEC. 
2, AEBCC. 
3. ACBEC. 
4,.ACBCC. 


(By convention, Ac Be C means A€ Band BEC, AE BCC means 
Aé Band BCC, and similarly for the other cases.) 


Solution 


1. Let A= {0}, B= {A}, and C= {B}. Then Ac Band BEC. But 
A¢C since A# B, and A ¢ C since A’s element @) is not an element 
of C. 


2. Because B C C, any element of B is also an element of C. So AEC. 
But A need not be a subset of C; for example, let A = {0}, B=C= 


{A}. 


3. Let A= {0}, B= {0, {{O}}}, C={B}. Then AC BEC but A¢C 
and AZ C. 


4. Ifa € A, then x € B (since A C B), but then also x € C (since B C C). 
So any z in A is also in C, making A a subset of C. But A need not 
be an element of C; for example, let A= B=C= 9. 
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A.3 Assignment 3: Due Wednesday, 2017-09-27, 
at 5:00 pm 


A.3.1 A powerful problem 


Recall that if A and B are sets, then A? is the set of all functions f : B > A. 
Let 1 = {0}, our usual representative one-element set. 
Show that if [14] = |A1], then |.A| = 1. 


Solution 


There is exactly one function f : A > 1 (it sends all elements of A to the 
unique element of 1) so |A'| = 1. 

We also have |A'| = |A|, because the function g: A! + A defined by 
g(f) = f(@) is a bijection. To show this, observe first that g is injective, 
since if g(f) = g(f’) we have f(@) = f’(0), which implies f = f’ since 0 
is the only element of the domain of f and f’. Then observe that g is 
surjective, since for any x in A, there is a function @ +4 x in A! such that 
gO «)= (0H «)() =a. 

Combining these facts and the assumption |14| = |A!| gives |A| = |A!| = 
\14] = 1. 


A.3.2 A correspondence 


Prove or disprove: For any sets A, B, and C, there exists a bijective function 


frOPP (CP). 


Solution 


Proof: For any function g: A x B - C in C4*®, define f(g) : A > C? by 
the rule f(g)(a)(b) = g(a, b). 

To show f is injective, let f(g) = f(g’). Then for any a in A and b in B, 
g(a, b) = f(9)(a)(o) = f(9’)(a)(0) = 9'(a, 6), giving g = 9. 

To show f is surjective, let h: A C®. Define f’(h) : A x B > C by 
the rule f’(h)(a,b) = h(a)(b). Then for all a € A, b € B, f(f’(h)) satisfies 
f(f'(h))(a)(b) = f'(h)(a,b) = h(a)(b), which gives f(f’(h)) = h. Since h 
was arbitrary, there is an f’(h) that covers every h in (c2 ‘e 

Since f is both injective and surjective, it is bijective. 

(This particular bijection is known as currying and is popular in func- 
tional programming.) 
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A.3.3  Inverses 


For each set A, the identity function 1,4 : A > A is defined by 14(x) = x 
for all x in A. 

Let f: A> Band g: BA be functions such that go f = 14. Show 
that f is injective and g is surjective. 


Solution 


First, let’s show f is injective. Let x, y be elements of A such that f(x) = f(y). 


Then x = 1a(x) = g(f(#)) = g(f(y)) = lay) = y- 
Next, let’s show that g is surjective. Let x be any element of A. Then 
f(x) is an element of B such that g(f(x)) = la(x) =<. 


A.4 Assignment 4: Due Wednesday, 2017-10-04, 
at 5:00 pm 

A.4.1 Covering a set with itself 

Prove or disprove: For any set A, and any surjective function f : A — A, f 

is bijective. 

Solution 


Disproof: Consider the set N (any infinite set will work, but N has conveniently- 
labeled elements). Define a function f : N > N by the rule 


_ja—-1 ife#40 
fe) ={° if =0. 


Then f is surjective, since for any y € N, y = f(y +1), but f is not 
injective, because f(0) = f(1) = 0. 
A.4.2. More inverses 


Let A be a set. Suppose that every function f : 4 — A has an inverse 
function f~!. How many elements can A have? 
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Solution 


Either 0 or 1. If A has 0 elements, then the empty function is its own inverse. 
If A has 1 element x, then there is exactly one function in A4, which maps 
x to x; this is also its own inverse. 

To show that these are the only possibilities, suppose A has at least two 
elements x and y. Let f be the function that sends both x and y to x, and 
sends all other elements z to themselves. This is not injective and so does 
not have an inverse. 


A.4.3 Rational and irrational 


Let q <r be real numbers such that gq is rational and r is irrational. Show, 
using the axioms and results in Chapter 4, that there exists a rational q/ 
such that q<q' <r. 


Solution 


We'll use Theorem 4.3.2, plus the fact that x + y is rational whenever x and 
y are both rational. (Proof: If c = a/b, and y = c/d, then x+ y = adtbe |) 

Because gq < r, we have 0 =q-—q<r-—gq. 

If r—q > 2, then 1 <2 <r-—q implies q+ 1 <r. In this case we can 
just set g =q+1. 

If r—q < 2, then we have 0 < r—q < 2 and so Theorem 4.3.2 says 
that there exists n € N such that n- (r —q) > 2. This n can’t be zero, so it 
has a multiplicative inverse and we can multiply both sides by n~! to get 
(r —q) > 2/n. But then we can set d = q+2/n<q+(r—q)=r. 


A.5 Assignment 5: Due Wednesday, 2017-10-11, 


at 5:00 pm 
A.5.1 <A recursive sequence 
Consider the sequence ag, @1,@2,... given by the rule ag = 1, a1 = 2, a9 
3, and for n > 2, dy = Gn—3 + Gn_2 + Gn_1. This sequence starts as 


1,2,3,6, 11, 20, 37,68,.... 
Show that a, < 2” for alln EN. 
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Solution 


The induction hypothesis is a, < 2”. This holds for aj = 1 < 2°, ay = 2 < 21, 
and ay = 3 < 2?; these serve as base cases. For n > 2, suppose that the 
hypothesis holds for k < n; then ay = an—3 + Gn—2 + Gn_1 < 2773 + 2-7 4+ 
2-1 = 2"(1/8 + 1/4 + 1/2) = 2"(7/8) < 2”. 


A.5.2 Comparing products 


1. Let ay,...,@,y and b;,...,6, be sequences such that 0 < a; < 6; for all 
i€{l1,...,n}. Prove that []j_, a: < []ji, &. 


2. Recall that n! = []/_, i. Show that, for any positive integer k, there 
exists nz, such that for all natural numbers n > nz, k” < nl. 


Solution 


1. The proof is by induction on n. 


We will use the stronger induction hypothesis that 0 < [[L,ai < 
TT, 0;, to save having to argue later than these quantities are both 
non-negative. 


For n = 0, the claim holds trivially: both products are empty and thus 
equal to 1. 


For larger n, we have []#_, a; = a1 - [Tio a and [J], &: = b1 - []f2 i, 
and from the induction hypothesis, 0 < [[i-5 a; < |i» bi. Since we 
also have 0 < a; < by, this gives 0 < a1: []/_» a; < bi -[]#_ b; and thus 
0 < [Tei a < [Tie &- 

(We are using here the fact that 0 < a < b and 0 < c < d implies 
0 < ac < bd. This is not given directly by our axioms for the reals, 
but is easily shown: ac < be < cd using Axiom 4.2.5, and similarly 
0=a-0<ac.) 


2. Let nz, = 2k? (other choices may also work, but this one makes the 
proof easier). 
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A.5.3 Rubble removal 


One morning, you wake up on a deserted island with nothing but a lifetime 
supply of food and water, access to the Internet, and several piles of rocks 
on the beach. While waiting for rescue, you decide to get rid of the rocks. 
Each day you may either (a) split an existing pile containing at least two 
rocks into two nonempty piles, or (b) pick up a rock from a one-rock pile and 
throw it into the ocean. After taking one or the other of these actions, you 
go back to the Internet café and continue working on your 202 homework 
until the next day comes around. 

For example, if you start with 3-rock pile, on day 1 you can split it into 
a 1-rock pile and a 2-rock pile, on day 2 you can throw away the 1-rock pile, 
on day 3 you can split the 2-rock into two 1-rock piles, and on days 4 and 5 
you can throw away the 1-rock piles. This strategy removes a 3-rock pile in 
just 5 days. 

If you start with k piles of sizes nj,n2,...nz, where each n; > 0, and 
on each day you take exactly one allowed action, what is the minimum and 
maximum number of days it will take to get rid of all of the rocks? Give a 
proof that your answer is correct. 
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Solution 


The minimum and maximum are both t = S7/",(2n; — 1) regardless of 
strategy. We can prove this by induction on f. 

When t = 0, there are no piles, and it takes no time to remove them. 

For t > 0, there is at least one pile. Suppose we split or remove pile 7. 

If we remove pile 7, where n; = 1, then we have a new sequence of k — 1 
piles n1,...,N;-1,Nj41,--.N~. Compute t/ = yITT (Qn; 1) eae 
1) =t—(2n; —1) =t—1. Because ?’ is less than t, the induction hypothesis 
tells us that it will take t/ = t — 1 days to remove the remaining piles. This 
gives a total of 1+ t=1+(t-—1) =t days. 

If we split pile 7, where n; > 1, then we get two new piles of size ng and 
np, where nq and np are both at least 1 and ng + np = n;. So now we have 
t) =t—(2n; —1) + (2nq — 1) + (2ny — 1) = t -— 2nj; +14+2n; —-2 =t-1. So 
again we get ¢/ <t—1 and the induction hypothesis tells us that it will take 
t — 1 days to remove the remaining piles, for t days total. 


A.6 Assignment 6: Due Wednesday, 2017-10-25, 
at 5:00 pm 
A.6.1 An oscillating sum 


For any n €N, let 


f(n) =(-1)" 0(-1)* «ke (A.6.1) 
k 


=0 
Give a closed-form expression for f(n), and prove that it is correct. 
Solution 


First let’s figure out what f(n) looks like, then try to prove that it works. 
We can make a table: 
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Ree ae El al) ek 
0 0 0 0 
i i 1 
% 2 1 1 
go 28 =9 2 
4 4 2 2 
5 5 =3 3 
6 6 3 3 
- 4 4 


This suggests a sequence 0,1,1,2,2,3,3,4,4,5,5,.... 
We can write this in closed form as 


ee ee 


2 


ile (A.6.2) 
To prove that this works, it may be helpful to expand it out a bit: 


S when n is even, 
g(n) = n+1 


x When n is odd. 

We will now argue by induction on n that f(n) = g(n). 
When n = 0, we have f(n) = (—1)°-0 = 0 and g(n) = 0. 
For larger n, expand 


=(-1)-f(n-—l)+n 
=n-—g(n-1), 


where the last equality follows from the induction hypothesis. 

If n is even, then n — 1 is odd, so f(n) =n-—g(n-1)=n-SS“ = 
n—-$=$=gQ(n). 

If n is odd, then n—1 is even, so f(n) =n—g(n—-1) =n-" = 4 = 
g(n). 

In either case, the induction step goes through, and we have f(n) = 
g(n) = |"#*| for alln EN. 
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A.6.2. An approximate sum 


Show that 
ye SO 22"), (A.6.3) 


Solution 
First observe that )“?_, k? -2* > n?-2" = Q(n? - 2”). 
For the upper bound, 
n n 
Se a ee 
k=1 k=1 


n 
= n2 S- 9k 
k=1 


n 
<n? S- gk 
k=0 
ae —] 
2-1 
2 On? 20" 
=OGr <2"), 


=n 


A.6.3 A stretched function 


Let f:N>Nandg:N-ON. 
Prove or disprove: If f(n) is in O(n), and g(n) is in O(n), then f(g(n)) 
is in O(n). 


Solution 


This is true, but the proof is a little trickier than one might expect, since 
we may have to consider some special cases depending on what input g(n) 
supplies to f(n). 

Suppose f(n) and g(n) are both in O(n). Let cf, np, cg, and ng be 
constants such that f(n) < crn for all n > ny and g(n) < cgn for all 
n> ge 

Now pick some n > ng, and consider f(g(n)). We have that g(n) < cgn. 
For f(g(n)), there are two cases: 


?We can drop the absolute values here because we know that f(n) and g(n) are always 
non-negative. 
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1. Ifg(n) < ny, then we know nothing about f(n). However, there are only 
finitely many possible values less than nf, so the set {f(n) |n < np} 
is finite, and so there is some upper bound 6 such that f(n) < 6 for all 
n< ng. 


2. If g(n) > ng, then f(g(n)) < cp: g(n) < cf cg +n. 


Let Cfog = Cf Cg and Nfog = max(ny, +). Then for any n > nfog, We 
have Cfog +1 2 Cfog * Nfog = Chg aa = b. So if the first case above holds, 
f(g(n)) <b < fog +n. If instead the second case holds, f(g(n)) < cfegn = 
Cfog* 2. In either case we have f(g(n)) < cfogn for n > Nfog, which shows 


f(g(n)) is in O(n). 


A.7 Assignment 7: Due Wednesday, 2017-11-01, 
at 5:00 pm 


A.7.1  Divisibility 
Show that, for alln EN, 
12 | (n(n+1)(n+ 2)(n + 8)). (A.7.1) 


Solution 


The proof below is an improved version of my original draft solution, in which 
I got carried away and used the Chinese Remainder Theorem. Discussions 
with several students caused me to realized that using CRT was overkill. 
The induction argument used below is adapted from a suggested proof by 
Alika Smith and replaces an uglier, though still valid, approach of finding a 
particular element of {n,n +1,...,n+k—1} that is divisible by k. There 
are many other ways to prove this result, but this is the one I like best. 

First let’s show that for any n € N and any k E Nt, k | [],_,n +k-1i. 
The proof is by induction on n for fixed k. 

When n = 0, the product is also 0, and k | [],_,,n +k — 10. 

Suppose now that k | [],;_,,.27 +k — 1%. Expand 


n+k n+k—-1 
II t=(n+k) II i 
i=n+1 i=n+1 


n+k-1 n+k—-1 


= Il +* IT é 


=n+1 
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Setting k = 4 gives 4 | n(n + 1)(n + 2)(n + 3). Setting k = 3 gives 
3 | n(n+1)(n+2) | n(n+1)(n+2)(n+3). But for any m, if 4| mand 3 | m, 
then lcm(4,3) = 12 must also divide m. So 12 | n(n + 1)(n + 2)(n + 3). 
A.7.2 Squares 
Let p be prime. Show that if 


x? =y” (mod p), (A.7.2) 
then either 

x=y (mod p) (A.7.3) 
or 

x=-y (mod p). (A.7.4) 
Solution 


This is mostly just high-school algebra. Working in Zp, we start with 
z= y. 
Subtract y? from both sides to get 
oo y" =0. 
Now factor the LHS to get 
(x+y)(e—y)=0 (mod p), 

which means that p | (2+ y)(a — y). 

Recall from §8.4.2.2 that if p | ab then p|a or p| b. So either p | (x+y), 
giving x + y = 0 (mod p) and thus « = —y (mod p), or p| (x — y), giving 
x —y =0 (mod p) and thus x = y (mod p). 


A.7.3 A Series of Unfortunate Exponents 


A C programmer working on a b-bit architecture decides to do a lot of 
unsigned integer exponentiation. Starting with an initial value 20, they 
compute a sequence of values 79,21, 22,-.. by the rule x;4; = x”, where k 
is some odd exponent. Because they are foolish enough to program in C, the 
actual rule is x4, = * mod 2°, since C throws away without warning all 
but the b least significant bits of the result when doing arithmetic, silently 
putting all operations in Z., instead of N. 

Show that if zg and k are both odd, then x.-2 = x9. 

(Added 2017-10-31: Assume b > 2. Also assume 0 < xo < oF) 


APPENDIX A. SAMPLE ASSIGNMENTS FROM FALL 2017 310 


Solution 
An easy induction argument shows that x7; = ak, 
Recall that Euler’s theorem says that if ged(a,m) = 1, then a®™ = 1 
(mod m). 
Since xo is odd, gcd(a, 2°) = 1, so of) =1 (mod 2°). We can compute 
(2°) by the rule (p") = (p — 1)p""1 = 1- 22-1 = 20-1, 
Since k is odd, gcd(k,2°-!) = 1, so by Euler’s Theorem, Ro") = 
2" = 1 (mod 2-1), 
Now consider %5-2 = & . Since k2"-* = 1 (mod 2°—1), we can rewrite 
k?~* as a2°-1 + 1, which makes 


k 
420-2 


A.8 Assignment 8: Due Wednesday, 2017-11-08, 
at 5:00 pm 


A.8.1 Minimal and maximal elements 


For this problem, we will consider partially-ordered sets whose elements are 
sets of natural numbers, and for which the ordering is given by C. For each 
such partially-ordered set, we can ask if it has a minimal or maximal element. 

A very small example would be {{0}, {0,1}, {2}$, which has minimal 
elements {0} and {2} and maximal elements {0,1} and {2}. 


1. Prove or disprove: There exists a nonempty R C P(N) with no maximal 
elements. 


2. Prove or disprove: There exists a nonempty S C P(N) with no minimal 
elements. 


3. Prove or disprove: There exists a nonempty T C P(N) that has neither 
minimal nor maximal elements. 
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Solution 


1. Proof: There are many choices here. One is to let R = {Ao, Ai, Ao,...} 
where A; = {7 © N| j <i}. Then # has no maximal elements, because 
for any A; € R, A; © Aigi € R. 


2. Proof: For this we will do the same thing as above in reverse. Let 
S = {Bo, Bi, Bo,...} where Bj = {7 EN|7 >i}. Then S has no 
minimal element, because for any B; € S, By 2 Bi+1. 


3. Proof: Here we can combine the previous two results by being a little 
sneaky. Let T = {Ci; |i ¢ N,j7 €© N} where each x € N is in C;; if 
and only if « = 2k and k < i, or x = 2k+1andk > j. Now 
T has no minimal or maximal elements, because for any Ci; € T, 
Cijaa & Ciy S Cissj- 


A.8.2. No trailing zeros 


Let ~ be a relation defined on N by the rule x ~ y if « = 2"y or y = 2*z for 
some k EN. 


1. Show that ~ is an equivalence relation. 


2. Consider the set N/~ of equivalence classes of ~. Show that there is a 
bijection f :N—>N/~. 


Solution 


To make our life easier, let’s start with a quick lemma: 


Lemma A.8.1. For any x,y © N, « ~ y if and only if there exists some 
k €Z such that « = 2*y in Q. 


Proof. Suppose « ~ y. Then either x = 2"y for some k € N C Z and we are 
done, or y = 2*’x for some k’ € N. In the latter case, solve for « = 27-*’y 
and let k = —k’. 

In the other direction, if « = 2*y, and k > 0, then x = 2*y for some 
k EN, giving x ~ y. If instead k < 0, then y = 2-*2, again giving x ~ y. 


1. We must show that ~ has all three conditions for being an equivalence 
relation: 


Reflexive For any x € N, r= 2°x sox~ a. 
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Symmetric If x ~ y, then from Lemma A.8.1 there exists k € Z such 
that « = 2*y. But then y = 2~*z, so applying the lemma again 
gives y ~ &. 

Transitive If 7 ~ y ~ z, then x = 2*y and y = 2°z for some k,£€ Z 
(Lemma A.8.1). Solve to get « = 2*+*z, which gives x ~ z. 


2. We have to be a little bit careful here. Most equivalence classes in N/~ 
are infinite sets of the form [27 + 1], = {2*(a +1) | ke N}, but [0]~ 
is a special case. 

Let 
O}~ when n = 0, and 
fin) =} 
[2n-—1]. whenn #0. 


We claim that f is a bijection between N and N/-~. 


First let’s show that [0]. = {0}. If 2 ~ 0, then x = 2*0 for some k € Z, 
which gives x = 0. 


To show f is injective, let f(z) = f(y). We wish to show x = y. If 
x = y = 0, we are done. If x = 0 and y £0, then 0 ~ 2y—-1 £0, 
contradicting f(x) = f(y); the same holds if y= 0 and « 40. If 40 
and y 4 0, then 27—1 ~ 2y—1. Assume without loss of generality that 
2x — 1 = 2*(2y — 1) for some k € N. Since the left-hand side is odd, 
the right-hand side must be odd as well, so k = 0 and 2a —1= 2y —1, 
which we can solve to get x = y. 


To show f is surjective, consider some equivalence class [m]. in N/~. 
Then [m]~ is a nonempty subset of the well-ordered set N, so it has a 
smallest element y. 


e If y is even, then y = 0, because otherwise y/2 ~ y is a smaller 
element of [m].. In this case [m]. = [0]. = f(0). 

e If y is odd, then y = 2a — 1 for some « € N. In this case, 
[m]~ = [2x — 1]Jv = f(x). 


In either case we have found an x € N such that [mj]. = f(x), and f is 
surjective. 
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A.8.3. Domination 


Given functions f : R — Rand g: R > R, f is dominated by g if 
f(x) < g(a) for all x € R.2 Write f < g if f is dominated by g. 


1. Prove that ~ is a partial order. 
2. Prove or disprove: ~ is a total order. 


3. Prove or disprove: ~ is a lattice, in the sense that for any functions 
f:R—-Randg:R-R, there exist functions f Ag :R—-— R and 
fVg:R-R satisfying the definitions of meet and join for x. 


Solution 
1. This is just a matter of verifying the requisite properties of =: 


Reflexive For all x € R, f(x) < f(x), so f x f. 


Anti-symmetric Let f < g and g x f. Then for all x ER, f(a) < 
g(x) < f(x) and thus f(x) = g(x). Since this holds for all z, 


=o 
Transitive Let f x g xh. Then for all x ER, f(x) < g(x) < h(a), 
giving f(x) < h(x). So f xh. 


2. It’s not a total order. Let f(x) = x and g(x) = —x. Then f(1)=1¢ 
—1= g(1) and g(-1) = 1 €1= f(-1). So it is not the case that for 
all «, f(x) < g(x), and it is not the case that for all x, g(x) < f(a): 
these particular functions f and g are incomparable. 


3. It is a lattice. Define f A g by the rule (f A g)(x) = min(f(z), g(z)) 
and (f V g)(x) = max(f(2), g(x). 
We will show that f Ag satisfies the definition of a meet. For all x, 
(fA g)(@) = min(f(x), 9(x)) < f(x) and similarly (f A g)(x) < g(2), 
so fAg x f and f Ag X g. If there is some function h such that 
h x f and h X g, then for all x, h(x) < f(x) and h(x) < g(x), so 
h(x) < min(f(x), g(x)) = (f A g)(x). This shows h = f Ag. 
To show f V g satisfies the definition of a join, apply duality to the 
preceding argument. 


°This definition works for functions f : A > B and g: A— B for any A and B as long 
as B is a partially-ordered set, but for this problem we will stick with A= B=R. 
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A.9 Assignment 9: Due Wednesday, 2017-11-15, 
at 5:00 pm 


A.9.1 Quadrangle closure 


Call a graph G = (V, EF) quadrangle closed if, for any simple path aga a2a3 
in G, agag is an edge in G. 


1. Define a quadrangle closure of G as a graph H with the property 
that G is a subgraph of H, H is quadrangle closed, and for any H’ 
such that G is a subgraph of H’ and H’ is quadrangle closed, H is a 
subgraph of H’. 


Show that every graph G has a unique quadrangle closure. 


2. Recall that a graph G is bipartite if it is possible to partition the 
vertices of G into two disjoint sets S and T, such that every edge in EF 
has one endpoint in S and one in T. 


Show that the quadrangle closure of a bipartite graph is bipartite. 


Solution 


1. We can do this part using the usual approach for closures: we’ll consider 
the set A of all quadrangle closed supergraphs of G, show that it is 
nonempty, and then argue that the intersection of all graphs in A is 
the quadrangle closure. (We define intersection in the obvious way, 
where the intersection of a family of graphs {G; = (V;, £;)} is the graph 
(NVi, 0 i)-) 

To show that A is nonempty, let n = |V|; then G is a subgraph of the 
complete graph K,,, which is quadrangle closed because a3ag exists for 
any pair of vertices ag and ag. 


To show that H = (.\;7¢,4 H' is quadrangle closed, consider any simple 
path aga aza3 in H. Then this path also appears in every H’ € A, and 
since A contains only quadrangle closed graphs, a3a9 must also appear 
in every H’ € A. But then a3ao appears in H. 


Finally, we need to show that H C H’ for any quadrangle closed 
supergraph H’ of G. But any such H’ is in A, so H C H’ because H is 
the intersection of all graphs in A. This also shows that H is unique, 
because if H’ is also a subgraph of any quadrangle closed supergraph 
of G, it is a subgraph of H, so between H C H’ and H’ C H we have 
De eee s hal 
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2. For this part, let us consider an alternative construction of the quad- 
rangle closure of G. 


Let Go = G, and for each G; that is not quadrangle closed, construct 
Gi41 by picking a path ajga;,a;2a;3 for which aj;3ajo9 is missing, and 
adding this missing edge to Gj41. 


If G; is bipartite, then so is G11: if we assume without loss of generality 
that ajo € S, then aj, € T, ajo € S, and ajg € T, so the new edge aj3ai0 
goes from T to S. It follows by induction on i that G; is bipartite for 
all i as long as Gop = G is. 


We can similarly show by induction that each G; is a subgraph of every 
quadrangle closed supergraph H’ of G. This holds trivially for Go = G. 
Now suppose that it holds for G;. Then the path ajga;1a;2a;3 appears 
in H’, and because H’ is quadrangle closed, aj3a;9 must also appear in 
H’. But then Gj, = G; Uaj3ajo contains only edges in H’, and so is a 
subgraph of H’. 


Since we can add only finitely many edges to Go, eventually we reach 
a G; to which we will add no more edges. This occurs when G; 
contains a;3a;9 for every path a;9a;1a;2a;3, which means that this G; is 
quadrangle closed. Since it is quadrangle closed and a subgraph of all 
quadrangle closed supergraphs of G, it is the quadrangle closure of G. 
We have already shown that every G; is bipartite if G is, so this shows 
that the quadrangle closure of a bipartite G is bipartite. 


A.9.2 Cycles 


Let G be graph with at least three vertices, such that for any two distinct 
vertices u and v in G, there are exactly two simple paths from u to v, and 
these paths have no edges in common. 

Show that G is a cycle. 


Solution 


First let’s find a cycle. 

Pick some vertex vg. The degree of vo is at least 1, because otherwise 
there are no paths from vo to any of the other vertices. 

Let v, be a neighbor of vp. Then there is a path from v1 to vo consisting of 
the single edge vj vg. From the condition on G, there must be a second path 
P=v ,v2...U,%U9 from v1 to vo. Because P is simple, all of these vertices are 


APPENDIX A. SAMPLE ASSIGNMENTS FROM FALL 2017 316 


distinct. So C = vgv1v2...U%V9 is a simple cycle on k vertices. We will now 
show that C= G. 

We do so in two steps. For the first step, we'll show that C' contains 
all the vertices in G. Suppose otherwise, and let w be a vertex not in C. 
Then there is at least one path from w to vg; let w’ be the last vertex in 
this path not in C, and let v; be the first vertex in this path in C. Then 
w'U;Vj41-..URV9 and w'v;vj_1 ...V1v9 are both paths from w’ to vo, but they 
violate the requirement of having no edges in common. So our assumption is 
false, and there is no w not in C. 

For the second step, we’ll show that C' contains all the edges in G. Suppose 
that there is an edge vjv; that is not in C. Without loss of generality, let 
i<j. Then vjvj41...0;, vivj, and vjvj41...vj_1v; are three different paths 
from 7 to j. This violates the requirement that there are exactly two vu;—-v; 
paths, so our assumption is false. 

It follows that C' D> G. By construction, C C G, so C =G, and Gisa 
cycle. 


A.9.3. Deleting a graph 


Suppose you are given a graph Go = (Vo, Eo), and wish to delete all of its 
vertices. At each step, you may pick some vertex v of G; = (Vj, E;) that has 
degree at most 1 in G;, and remove it, leaving G;;, as the induced subgraph 
of G; on V; \ {v}. If there is no such vertex v, then you are stuck. 

Show that you can reduce a finite graph Gop to the empty graph with no 
vertices by this process if and only if Go is acyclic. 


Solution 


The proof is by induction on the number of vertices n. 

If n = 0, we start with an empty graph, which is acyclic. This gives us 
the base case. 

For n > 0, there are two cases, depending on the structure of Go: 


1. There exists a vertex v of degree 1 or less. Because v has degree at 
most 1, it can’t be part of a cycle, so removing it gives a new graph 
G, that contains a cycle if and only if G; does. (See Lemma 10.10.5.) 


Remove this v to get a graph G, of n — 1 vertices. The induction 
hypothesis says that G; can be reduced to the empty graph if and only 
if G, (and thus Go) is acyclic. 


APPENDIX A. SAMPLE ASSIGNMENTS FROM FALL 2017 317 


2. There are no vertices with degree 1 or less, and we are stuck. But 
then every vertex has degree at least 2, so by the Handshaking Lemma 
(Lemma 10.10.3) there are at least n edges. Corollary 10.10.7 says that 
Go contains a cycle. 


A.10 Assignment 10: Due Wednesday, 2017-11-29, 
at 5:00 pm 


As always, justify your answers. 


A.10.1 Too many injections 


Let A, B, and C be k-element sets, and let S be an n-element set, where 
k<n. 

How many different triples of functions f : A> S,g: B- S, and 
h:C — S are there such that f, g, and h are all injective, and f(A) = 
g(B) = h(C)?4 


Solution 


First let’s pick the set T = f(A) = g(B) = h(C). There are (7) different 
ways to do this. Because each of f, g, and h is injective, and is surjective 
on T, each function is a bijection between A, B, or C' and T. Having fixed 
T, there are k! choices of bijection for each set. Multiplying everything out 


gives 
(:’) (1)® = (2), (RY)? 


choices. 

(I would write this in the form on the left, since it more closely reflects 
the preceding argument, but the right-hand version works as well.) 
A.10.2 Binomial coefficients 


Show that, for any k,m,n €N such thatO <<k<m<n, 


(n)e)=()(naa) an 


“This last bit means that f, g, and h all have the same range. The convention is that 
when A is a subset of the domain of f (or even the entire domain), f(A) = {f(a) | « € A}. 
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Solution 


Let S be an n-element set. Then S has (2) m-element subsets T’, and each 
such T has ({') k-element subsets U. This gives a total of ()") (7!) pairs 
(T,U) where |T| =m, |U| =k, and UCT CS. 

We now give an alternative way to construct T and U. First pick U C S$ 
with |U| = k: there are (j;) ways to do this. Now we’ll pick T \ U, which will 
be an (m—k)-element subset of the (n—k)-element set S\U; there are ( a2) 
ways to do this. So we get: (j;) (ey pairs (U,T \ U), and there is a bijection 
mapping these to pairs (T’,U’) given by U' =U and T’ =U U(T \U). 

Since we have a bijection between a set of size (”)(‘") and a set of size 


(2) (ae these quantities must be equal. 


A.10.3. Variable names 


A certain poorly-designed programming language limits variable names to 
consist of zero or more letters from the set {a,b,c} followed by zero or more 
digits from the set {0,1,2,3}. So the empty string, c, 12, cab100, and 
abba2012 are all legal variable names, but 1c and actla are not. 

There is exactly 1 legal variable name of length 0: the empty string. 
There are exactly 7 legal variable names of length 1: a, b, c, 0, 1, 2, and 3. 
There are 37 legal variable names of length 2: aa, ab, ac, a0, a1, a2, a3, ba, 
bb, bc, bO, b1, b2, b3, ca, cb, cc, c0, c1, c2, 3, 00, 01, 02, 03, 10, 11, 12, 
{3:'90..9999. 93. 20, S1--S9-and Sa: 

Give a closed-form expression for the number of legal variable names of 
length n. 


Solution 
There are many, many ways to do this. Here are some of them. 
Disjoint union plus power series formula: A variable name of length 


n will consist of k letters followed by n — k digits, for some k. For each fixed 
k, this gives 3°4"-* possibilities. These cases are all disjoint, so summing 
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over all possible k& gives 


n 


k=0 


k=0 
1 — (3/4)"? 
1 — (3/4) 
=4"-4-(1-(3/4)""1) 
= qntl _ grt 


— 4” 


Generating function: Construct a generating function for the series 

9 Anz”, Where ay, is the number of legal variable names of length n. 
Each variable name consists zero or more letters from a 3-character 

alphabet, then zero or more digits from a 4-character collection of digits. 


: ‘ : 1 
The generating function for the letters is )77°.9 3"2" = q=3;- 


1 
1-4z° 


The generating function for the digits is 772.) 4"z" = 
Multiplying these together gives a generating function 
1 

(1 — 3z)(1 — 4z) 

ee: ee 

~ 1-32 ° 1-42 

_ A(l — 4z) + BI - 3z) 
(1—3z)(1-4z) ~ 


P(e) = 


for coefficients A and B to be determined. 
Matching coefficients in the numerator gives 


1=A+B 
0=4A+4 3B, 
which has the convenient solution 
A=-3 
B=A4., 
This gives 
os +. 4 
Pees Late 


co co 
= 355 3"2" 4450 472", 
n=0 n=0 


F(z) 
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from which we can read off the coefficients 


ie Ae. aol 
= 


Combinatorial proof: This is tricky to do unless you already know what 
the answer is. 

The expression suggests encoding each variable name of length n as a 
string of characters of length n +1 from an alphabet of size 4 (the 4"*'), 
where any string consisting only of characters from some sub-alphabet of 
size 3 is forbidden (the —3”"+1). 

Here is one such encoding: Consider a string of n +1 characters %...%n 
from {a, b,c, d}. 

Define a new string yo... Yn—1 by 


lx if 2; Ad for all 7 <i 
a f(xi41) otherwise, 


where f maps each letter a, b, c, d to the corresponding digit 0, 1, 2, 3. 

Given one of the 4"*! strings x of length n+1, one of two things happens: 
either « contains a d, in which case it is mapped to a string of length n 
satisfying the rules for variable names, or it does not, in which case it is 
mapped to a string of length n + 1 which may still satisfy the rules (it’s all 
letters a, b, or c), but is too long. There are 3”*! such exceptions, and the 
non-exceptional inputs map bijectively to the legal variable names of length 
n. So our original set of x’s is a disjoint union of 4"+! — 3+! strings that 
correspond to legal variable names of length n and 3”*! strings that don’t. 
This gives 4"+! — 3+! legal variable names of length n. 


Induction proof: Here we want to somehow reduce the number of variable 
names of length n to a smaller case or cases. This is a little bit tricky, because 
after choosing the first character of a variable name we may be constrained 
in what we can put in the rest. 

To make this work, we can split the set of variable names S,, of length n 
into two disjoint subsets: the set A, of length-n variable names that start 
with a letter, and the set B, of length-n variable names that start with a 
digit. Because every character in a name in B, must be a digit, we can 
calculate |B,| = 4". For an element of An, there are 3 choices for the first 
letter; after this we have |S,,_1| choices for the remaining characters, because 
any legal length-(n — 1) variable name can follow a letter. 
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This tells us that |S,,| = 3|S,-1|+4", which is an example of a recurrence. 
Unfortunately we did not spend much time on solving recurrences this 
semester. But if we miraculously guess the correct solution |S,,| = 4"+1!—3"*1, 
we can verify that it works using induction, by showing |So| = 4—3 = 1 
(base case) and |S,,| = 3|S,_1| +4” = 3(4” — 3") +4” = (841) -4"—3.3"= 
4r+1 _ 3"+1 (induction step). 


Appendix B 


Sample exams from Fall 2017 


B.1 CPSC 202 Exam 1, October 17th, 2017 


Write your answers on the exam. Justify your answers. Work alone. Do not 
use any notes or books. 

There are four problems on this exam, each worth 20 points, for a total 
of 80 points. You have approximately 75 minutes to complete this exam. 


B.1.1_ Factorials (20 points) 
Prove that 2” divides (2n)! for all n € N. 


Solution 


By induction on n. When n = 0, we have (2-0)! = 2° = 1. 

For larger n, (2n)! = []?@, k = 2n-(2n—1)-]]7257 = 2n-(2n—-1)-(2(n—1))!. 
From the induction hypothesis, 2”~! divides (2(n — 1))!, so there exist an 
m € Nsuch that m-2"~! = (2(n—1))!. But then (2n)! = 2n-(2n—1)-m-2”-1 = 
n-(2n —1)-m- 2”, and 2” divides (2n)!. 


B.1.2 A tautology (20 points) 
Using a truth table, show that 
(PAQ) > (P > Q) (B.1.1) 


is true for all values of P and Q. 
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Solution 
PQ PAQ PQ (PAQ)->(P-Q) 
0 O 0 1 1 
0 1 0 1 1 
1 0 0 0 1 
1 1 1 1 1 


B.1.3 Subsets (20 points) 


(For this problem, assume that A and B are sets.) 
Prove or disprove: 


VA:(VB:B=AVBE€A)7A=D). 


Solution 


Proof: Fix A. Suppose that B = AV B Z A holds for all B. Let B = 0. 
Then B C A, so for B= AV B Z A to hold it must be the case that B= A. 
But then A= B= 90. 


B.1.4 Surjective functions (20 points) 


Prove or disprove: For all functions g: A > Band f: B > C, if f is 
surjective, and g is surjective, then f og is surjective. 


Solution 


Proof: Let f and g be surjective. Let c be some element of C’. Because 
f is surjective, there exists some b € B such that f(b) = c. Because g is 
surjective, there exists some a € A such that g(a) = b. But then f(g(a)) =c. 
Since our choice of c was arbitrary, this means that for any c € C there is an 
a € A such that (fo g)(a) = f(g(a)) =c. So f og is surjective. 


B.2 CPSC 202 Exam 2, December 7th, 2017 


Write your answers on the exam. Justify your answers. Work alone. Do not 
use any notes or books. 

There are four problems on this exam, each worth 20 points, for a total 
of 80 points. You have approximately 75 minutes to complete this exam. 
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B.2.1 Non-decreasing sequences (20 points) 


Recall that a sequence a1, a2,...,@p is non-decreasing if a; < a; when 7 < 7. 

Suppose we generate a sequence @1,42,...,@ of n values from {0, 1,2} 
uniformly at random, so that all such sequences are equally likely. What is 
the probability that a is non-decreasing? 


Solution 


A non-decreasing sequence can be described by a partition n = no +n, + no, 
where n,; is the number of values 7 that appear in the sequence. There are 
n+1 choices for no (anywhere from 0 to n), and given no there are n— no +1 
choices for n1. So the total number of possibilities is 


n 


So (n= 29 +1) = (n+ intl) — Yo no 


no=0 no=0 
n(in+1 
= (n+ Angi) Med 
_ n2+3n+2 
2 


— (n+1)(n +2) 


2 
— (nt2 
= ae if 
This quantity can also be derived by a combinatorial argument, since 
we can generate a partition n = no + ny + ng by lining up n+ 2 objects, 
removing two of them, and letting no, n1, and nz be the size of the regions 
separated by the resulting gaps. Since we are picking two distinct objects 
and don’t care about their order, there are (33) ways to do this. 
Alternatively, we can use generating functions. The generating function 
for each sequence of identical digits is > so the generating function for 
three consecutive sequences is aa = (1-—z)~%. Expanding this out using 
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the binomial theorem gives: 


a-2=>> (P) ca 

n=0 
. n —3 nn 

= 2Y -) Zz 

= yay tan 
n=0 i 

= AAD yn 

ley 


In each case, the probability that the sequence is non-decreasing is 
(Ce) /3” 
: . 


B.2.2 Perfect matchings (20 points) 


A matching on a graph G is a subgraph M of G where every vertex in M 
has degree exactly 1. A perfect matching on G is a matching that includes 
every vertex in G. 

As a function of n, how many perfect matchings are there on Ko, the 
complete graph with 2n vertices? For full credit, put your answer in closed 
form. 


Solution 


The quickest way to do this is to observe that we can generate a perfect 

matching by choosing a numbering of the vertices from 0 to 2n — 1 (there 

are (2n)! ways to do this), and then matching each even-numbered vertex 2k 

with the following odd-numbered vertex 2(k +1). Because we could flip each 

pair of matched vertices and get the same matching, as well as reordering the 

ace this counts each matching 2”n! times. So the number of matchings is 
nN): 


resi 
‘ * igernatively: we could generate the matching one step at a time. Suppose 
we have already numbered the vertices of G from 0 to 2n — 1. Consider the 
following method for generating a perfect matching. At each step, pair the 
smallest-numbered unmatched vertex with some other unmatched vertex. 
After adding k edges, there are 2(n — k) — 1 ways to do this. Multiplying out 
all these values gives Tz (2(n —k)—1) = ]]f-, (2k — 1) possible perfect 
matchings. 


APPENDIX B. SAMPLE EXAMS FROM FALL 2017 326 


If we happen to remember the double factorial notation, we can write 
this as (2n — 1)!!. Otherwise, we have to do some work to construct this 
product using ordinary factorial. 

One way to do this is to take the product of all the numbers from 1 to 
2n and remove the even numbers. Observe that 


— 
bo 
3 
“—" 
II 
> 
ioe 
— 
bo 
> 
| 
—_ 
~—" 
eee 
> 
toe 
bo 
x 
Se 


II 
> 
ioe 
— 
bo 
> 
| 
— 
—" 
Se 
bo 
3 
> 
Ss 
= 
See 


lI 
ua 


I 
ce 
4 
a 
i) 
a 
| 
—_ 
~" 
So 
i) 
3 
= 


Dividing out 2”n! gives 


[lee 1)= (2n)! 


Qn) * 


> 
ll 
un 


This happens to agree with our previous solution, which is always reas- 
suring. 

We can also start with (2n — 1)! and remove the even numbers from 2 to 
2n — 2. This gives the equivalent, though slightly less compact, closed-form 


. 2n 
expression 3n=T(n—1)! ‘ 


B.2.3. Quadratic forms (20 points) 


A quadratic form is a function f : R” —> R of the form 


n n 


f(x) = Od caja, 


i=1 j=l 


where the cj; are constants. 
Show that for any quadratic form f, there is a matrix A such that 
f(x) =a' Ax, where x is represented as a column vector. 
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Solution 


Use the definition of matrix multiplication to expand 


e A= 2" (Ag) 


Setting A;,; to cj; for all ¢ and j makes this equal to f(a). 


B.2.4 Minimal lattices (20 points) 


Prove or disprove: For any partial order (S,<) that is a lattice, if x is a 
minimal element of $, x is also a minimum element of S. 


Solution 


Here are two proofs. 

Direct proof: Let x be a minimal element of S. Then for any y < a, 
y = xz. To show z is a minimum, we need to show that x < z for all z in 
S. Pick some such z, and consider the element x A z. Then x A z < x, so 
xaA\z=a. Butthenx=2Az<z. 

By contraposition: Suppose x is not a minimum element. Then there 
exists y such that « < y. Let z=aAy. Then z < x and z < y. Because 
z<y, 24. So there exists a z such that z < « and z #2, meaning that x 
is not minimal. 


Appendix C 


Sample assignments from 
Fall 2013 


These are sample assignments from the Fall 2013 version of CPSC 202. 


C.1 Assignment 1: due Thursday, 2013-09-12, at 
5:00 pm 
Bureaucratic part 


Send me email! My address is james.aspnes@gmail.com. 
In your message, include: 


1. Your name. 


2. Your status: whether you are an undergraduate, grad student, auditor, 
etc. 


3. Anything else you’d like to say. 


(You will not be graded on the bureaucratic part, but you should do it 
anyway.) 
C.1.1 Tautologies 


Show that each of the following propositions is a tautology using a truth 
table, following the examples in §2.2.2. Each of your truth tables should 
include columns for all sub-expressions of the proposition. 
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1. (AP > P)OP. 


2, PV(Q7> -(P 4 Q)). 


3. (PVQ) 6 (QV(P 4 (Q- R))). 


Solution 


For each solution, we give the required truth-table solution first, and then 
attempt to give some intuition for why it works. The intuition is merely an 
explanation of what is going on and is not required for your solutions. 


1. Here is the truth table: 


P =P 7AP3P (APo>P)oP 
0 1 0 1 
1 0 1 1 


Intuitively, the only way for an implication to be false is if it has a true 
premise and a false conclusion, so to make ~P — P false we need P 
to be false, which is what the tautology says. 


2. Here we just evaluate the expression completely and see that it is 
always true: 


PQ P#Q 
0 O 1 
0 1 0 
1 0 0 
1 1 1 


(PQ) Q> ((P + Q)) 
0 1 
1 1 
1 1 
0 0 


PV (Q > (“(P © Q))) 


je ee 


This is a little less intuitive than the first case. A reasonable story 
might be that the proposition is true if P is true, so for it to be false, 
P must be false. But then =(P © Q) reduces to Q, and Q © Q is 


true. 


3. (PVQ) 6 (QV (P46 (Q- R))). 


FPRERROOOOYN 
FPROORKFOOO 


rOrROrFOrFODA 


v 


PRR RPRROO< 


© 


Q 


RORRROREF] 


R P#(Q->R) 
0 


0 


QV (P+ (Q > R)) 


HHH HHOO 


(PV Q) + (QV (P + (Q > R))) 


PBR RRB RRR 
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I have no intuition whatsoever for why this is true. In fact, all three 
of these tautologies were plucked from long lists of machine-generated 
tautologies, and three variables is enough to start getting tautologies 
that don’t have good stories. 


It’s possible that one could prove this more succinctly by arguing by 
cases that if Q is true, both sides of the biconditional are true, and if 
Q is not true, then Q > R is always true so P + (Q — R) becomes 
just P, making both sides equal. But sometimes it is more direct (and 
possibly less error-prone) just to “shut up and calculate.” 


C.1.2 Positively equivalent 


Show how each of the following propositions can be simplified using equiva- 
lences from Table 2.2 to a single operation applied directly to P and Q. 


1. =(P > 7Q). 
2. a((PA=Q) V (4PAQ)). 


Solution 
1. 

(P + aQ) = 7(-P V 7Q) 
=—PA a7nQ 
=PAQ. 

2. 


= “(P A-Q) A7(-P AQ) 
(GP V 77Q) A (“4P V 7Q) 
= (“PV Q) A(PV-Q) 
(sP VQ) A (“QV P) 
=(P>Q)A(Q-> P) 
=PoQd. 
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C.1.3 A theory of leadership 


Suppose we are trying to write down, in predicate logic, a theory explaining 
the success or failure of various historical leaders. The universe consists of 
historical leaders; when writing Vx we are implicitly limiting ourselves to 
historical leaders under consideration, and similarly when writing dz. We 
have two predicates taller(x, y) (“a was taller than y”) and successful(x) (“a 
was successful as a leader), as well as all the usual tools of predicate logic V, 
4, =, and so forth, and can refer to specific leaders by name. 

Express each of the following statements in mathematical form. Note 
that these statements are not connected, and no guarantees are made about 
whether any of them are actually true. 


1. Lincoln was the tallest leader. 
2. Napoleon was at least as tall as any unsuccessful leader. 


3. No two leaders had the same height. 


Solution 


1. The easiest way to write this is probably Vz : taller(Lincoln, x). There 
is a possible issue here, since this version says that nobody is taller than 
Lincoln, but it may be that somebody is the same height.! A stronger 
claim is Vz : (« # Lincoln) — taller(Lincoln, x). Both solutions (and 
their various logical equivalents) are acceptable. 


2. Vx : asuccessful(x) — —taller(x, Napoleon). 


3. VaVy: (x = y)Vtaller(x, y)Vtaller(y, x). Equivalently, Vz Vy : 2 #4 y > 
(taller(a, y) V taller(y,z)). If we assume that taller(«, y) and taller(y, x) 
are mutually exclusive, then Va Vy : (x = y) V (taller(z, y) @ taller(y, x)) 
also works. 


At least one respected English-language novelist [Say33] has had a character claim 
that it is well understood that stating that a particular brand of toothpaste is “the most 
effective” is not a falsehood even if it is equally effective with other brands—which are 
also “the most effective”—but this understanding is not universal. The use of “the” also 
suggests that Lincoln is unique among tallest leaders. 
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C.2 Assignment 2: due Thursday, 2013-09-19, at 
5:00 pm 


C.2.1 Subsets 


Let R, S, and T be sets. Using the definitions of C, U, M, and \ given in 
§3.2, prove or disprove each of the following statements: 


1. Ris a subset of S if and only if RC (S\T)U(RNT). 
2. Risa subset of S if and only if RC ROS. 


3. Ris a subset of S\ R if and only if R= 0. 


Solution 


1. Disproof: Consider R = T = {1}, S=0. Then R is not a subset of S, 
but 


(S\T)U(ROT) = (O\ {1}) U({1}. {1}) 
= QU {1} 
= {1} 
a Ee, 


2. Proof: We need to show this in both directions: first, that if R is a 
subset of S, then RC RMS; then, that if RC ROS, Ris a subset of 
S. 


Suppose that R is a subset of S. Let x be an element of R. Then z is 
also an element of S. Since it is an element of both R and S, x is an 
element of RNS. It follows that RC RNS. 


Conversely, suppose that R C RMS, and let x be an element of R. 
Then x is an element of RMS, implying that it is an element of S. 
Since x was arbitrary, this gives that every element of R is an element 
of S,or RCS. 


3. Proof: If R = 0, then R is a subset of any set, including S \ R. 
Alternatively, if R #0, then R has at least one element x. But S\ R 
contains only those elements y that are in S but not R; since x is in R, 
it isn’t in S\ R,and RZ S\ R. 
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C.2.2 A distributive law 

Show that the following identities hold for all sets A, B, and C: 
1. Ax (BUC) =(Ax B)U(AxC). 
2, Ax (BNC)=(Ax B)N(AxC). 


Solution 


1. Let (a,z) € Ax (BUC). Thenaé€ Aandxe BUC. If xe B, then 

(a,x) € Ax B; alternatively, if « € C, then (a,x) € A x C. In either 
case, (a,z) € (Ax B)U(AxC). 
Conversely, if (a,x) € (A x B)U(A x C), then either (a,x) €¢ Ax B 
or (a,x) € Ax C. In either case, a € A. In the first case, x € B, 
and in the second x € C, giving x € BUC in either case as well. So 
(a,x) € Ax (BUC). 


2. Let (a,x) € Ax (BNC). Thena€ Aand xz € BNC, giving « € B and 
xé€C. Fromae€ A and «€ B we have (a,x) € A x B; similarly from 
a € Aand x € C we have (a,x) € AXC. So (a,x) € (Ax B)N(AXxC). 


Conversely, if (a,7) € (A x B)N(A x C), then (a,x) € Ax B and 
(a,x) € Ax C. Both give a € A, and the first gives x € B while the 
second gives x € C. From this we can conclude that « € BNC, so 
(a,x) € Ax (BNC). 

C.2.3. Exponents 


Let A be a set with |A| = n > 0. What is the size of each of the following 
sets of functions? Justify your answers. 


1. A®. 
2. QA, 


3. 0°. 


Solution 


al || = 1. Proof: There is exactly one function from to A (the empty 
function). 
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2. 04] = 0. Proof: There is no function from A to 0, because A contains 
at least one element x, and any function f : A — @ would have to map 
x to some f(a) € 0, a contradiction. 


3. 0° | = 1. Proof: This is just a special case of A’; the empty function is 


the only function with 0 as a domain. Note that this doesn’t contradict 
the 04 result, because there is no x € @ that we fail to send anywhere. 


C.3 Assignment 3: due Thursday, 2013-09-26, at 
5:00 pm 


C.3.1 Surjections 


Let f : S > T be surjective. Let S’ C S, and let T’ = f(S’) = {f(x) |x e€ S’}. 
Prove or disprove: For any f,.S,T,S’ as above, there exists a surjection 
g:S\S 3T\T". 


Solution 


Disproof: Suppose S$ 4 S’ but T = 7”; this can occur, for example, if 
S = {a,b}, T = {z}, fle) = (0) = 2, and S’ = {a}. In this case, 
T’ =T = {z}, giving T\ T’ = 0. But S'\ S’ = {b} 4G, and since there 
are no functions from a nonempty set to the empty set, there can’t be a 
surjection g: S\ S’ > T\T". 


The solution that got away 


I will confess that when I wrote this problem, I forgot about the empty set 


case.” 


Here is a proof that g: S\ S’ > T\ T’ exists when T’ 4 T, which, alas, 
is not the actual claim: 

Let g(x) = f(x) for each x in S \ S’; in other words, g is the restriction 
of fto S\ S. 

Let y€ T\T"’. Because f is surjective, there exists x in S with f(x) = y. 
Fix some such x. We have f(x) = y ¢ T’, sox ¢ S’. It follows that x is in 
Sos 

We've just shown that for any y € T \ T", there is some x € S \ S’ such 
that f(a) = y. But then g(x) = f(x) = y as well, so g is surjective. 


TI am grateful to Josh Rosenfeld for noticing this issue. 
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C.3.2 Proving an axiom the hard way 


Recall that, if a, b, and c¢ are all in N, and a < b, thena+c < b+ec 
(Axiom 4.2.4). 

For any two sets S and J, define S > T if there exists an injection 
f:S—4T. We can think of — as analogous to < for sets, because |S| < |T| 
if and only if S — T. 

Show that if ANC = BNC =9, and A> B, then 


AUC» BUC. 


Clarification added 2013-09-25 It’s probably best not to try using the 
statement |S| < |T| if and only if S > T in your proof. While this is one 
way to define < for arbitrary cardinals, the odds are that your next step is 
to assert |A| + |C| < |B] + |C], and while we know that this works when A, 
B, and C are all finite (Axiom 4.2.4), that it works for arbitrary sets is what 
we are asking you to prove. 


Solution 


We’ll construct an explicit injection g: AUC + BUC. For each x in AUC, 
let 


f(x) if «eA, and 
~ ifaec. 


Observe that g is well-defined because every x in AUC is in A or C but 
not both. Observe also that, since BN C = 0, g(x) € B if and only if x € A, 
and similarly g(a) € C if and only if eC. 

We now show that g is injective. Let g(x) = g(y). If g(x) = g(y) € B, 
then x and y are both in A. In this case, f(x) = g(x) = g(y) = fly) 
and x = y because f is injective. Alternatively, if g(a) = g(y) € C, then 
x = g(x) = g(y) = y. In either case we have x = y, so g is injective. 


C.3.3. Squares and bigger squares 


Show directly from the axioms in Chapter 4 that 0 < a < b implies a-a < 6-0, 
when a and 0 are real numbers. 
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Solution 


Apply scaling invariance (Axiom 4.2.5) to0 <aanda< b to get a-a<a-b. 
Now apply scaling again to 0 < banda < b to get a-b< b-b. Finally, apply 
transitivity (Axiom 4.2.3) to combine a-a < a-banda-b < b-6b to get 
a:-a<b-b. 


C.4 Assignment 4: due Thursday, 2013-10-03, at 
5:00 pm 


C.4.1 A fast-growing function 
Let f : N > N be defined by 


f(n+1) = f(n)- f(n) = 1. 


Show that f(n) > 2” for alln EN. 


Solution 


The proof is by induction on n, but we have to be a little careful for small 
values. We’ll treat n = 0 and n = 1 as special cases, and start the induction 
at 2. 

For n = 0, we have f(0) =2>1=2°. 

For n = 1, we have f(1) = f(0)- f(0) -1=2-2-1=3>2=2!. 

For n = 2, we have f(2) = f(1)- f(1) -1=3-3-1=8>4=27. 

For the induction step, we want to show that, for all n > 2, if f(n) > 2”, 
then f(n+1) = f(n)- f(n) —1 > 2"*1. Compute 


f(n+1)=f(n)-fn)-1 
>27.2"— 1 
=2".4—1 
= gntl 4 grt _ 1 
a aes 


The principle of induction gives us that f(n) > 2” for all n > 2, and 
we’ve already covered n = 0 and n = 1 as special cases, so f(n) > 2” for all 
neN. 
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C.4.2 A slow-growing set 


Let 
Ag — {3,4, 5} ; 
An+1 = An U S- ax. 
re An 


Give a closed-form expression for S, = )),¢4, 2. Justify your answer. 


Clarification added 2013-10-01 For the purpose of this problem, you 
may assume that \o,¢4¢+ Vieept = Lieeaup if A and B are disjoint. 
(This is provable for finite sets in a straightforward way using induction on 
the size of B, but the proof is pretty tedious.) 


Solution 
Looking at the first couple of values, we see: 
So =3+44+5=12 


§,=34+44+5412=24 
So=3+4+5412+424= 48 


It’s pretty clear that the sum is doubling at each step. This suggests a 
reasonable guess would be >7.¢4, © = 12-2”, which we’ve shown works for 
n=0. 

For the induction step, we need to show that when constructing Ani = 
An U {Sn}, we are in fact doubling the sum. There is a tiny trick here in 
that we have to be careful that S,, isn’t already an element of Ay. 


Lemma C.4.1. For all n, Sy ¢ An. 


Proof. First, we’ll show by induction that |A,,| > 1 and that every element 
of A, is positive. 

For the first part, |Ao] = 3 > 1, and by construction An41 D An. It 
follows that A, D Ap for all n, and so |A,,| > |Ao| > 1 for all n. 

For the second part, every element of Ag is positive, and if every element 
of A, is positive, then so is S, = }0,¢4, v. Since each element x of An+1 is 
either an element of A, or equal to S,, it must be positive as well. 

Now suppose S, € Ay. Then S, = S, + Vere An\{Sn} x, but the sum is 
a sum of at least one positive value, so we get S,, > S,, a contradiction. It 
follows that S, ¢ An. 
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Having determined that $,, ¢ An, we can compute, under the assumption 


that S, = 12-2”, 


Sn41 = S- x 


tE€An+1 


rEAn 
= Sn+Sn 
= 1922 19:60" 
= 12. (9% +2") 
= 12.2r71, 


This completes the induction argument and the proof. 


C.4.3 Double factorials 


Recall that the factorial of n, written n!, is defined by 
eal We eae 


The double factorial of n, written n!!, is defined by 
[(n—1)/2] 
n!! = II (n — 2%). 


i=0 


(C.4.1) 


(C.4.2) 


For even n, this expands to n:(n—2)-(n—4)-...2. For odd n, it expands 


to n-(n—2)-(n—4)-...1. 


Show that there exists some ng, such that for all n in N with n > no, 


(2n)!! < (n!)?. 


Solution 


First let’s figure out what ng has to be. 


We have 
(2-0)! =1 O?=1-1= 
(2-1)! =2 ayst1= 
(2-2it=4-2=8 (2! 
(2-3)"!=—6-4-2=—48 (3!)? = 6-6 = 36 
(2-4)! =8-6-4-2 = 384 (4!)? — 24-24 — 576 


(C.4.3) 
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So we might reasonably guess ng = 4 works. 

Let’s show by induction that (2n)!! < (n!)? for all n > 4. We’ve already 
done the base case. 

For the induction step, it will be helpful to simplify the expression for 
(2n)!: 


[(2n—1)/2] 
(2n)!= JJ (2n-2i%) 
1=0 

n-1 


(2n — 22) 


(The last step does a change of variables.) 
Now we can compute, assuming n > 4 and that the induction hypothesis 
holds for n: 


= ((2n)H) -2- (n+ 1) 

< (nl)? ae (n+1) 

= (n!-(n+1) 

(iL): -s) 
i=l 

= ((n+1)!)? 


C.5 Assignment 5: due Thursday, 2013-10-10, at 
5:00 pm 


C.5.1 A bouncy function 
Let f :N > N be defined by 


n if nis even. 


1 if n is odd, and 
roie{ 
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1. Prove or disprove: f(n) is O(n). 


2. Prove or disprove: f(n) is Q(n). 


Solution 


1. Proof: Recall that f(n) is O(n) if there exist constants c > 0 and N, 
such that |f(n)| <c-|n| for n > N. Let c=1 and N = 1. For any 
n > 1, either (a) f(n) = 1 <1-n, or (b) f(n) =n <1-n. So the 
definition is satisfied and f(n) is O(n). 

2. Disproof: To show that f(n) is not Q(n), we need to show that for any 
choice of c > 0 and N, there exists some n > N with |f(n)| < c- |n|. 


Fix cand N. Let n be the smallest odd number greater than max(1/c, NV) 
(such a number exists by the well-ordering principle). Then n > N, 
and since n is odd, we have f(n) = 1. But c-n > c- max(1/c, N) > 
c:(1/c) =1. Soc-n> f(n), concluding the disproof. 

C.5.2 Least common multiples of greatest common divisors 


Prove or disprove: For all a,b,c € Z*, 


Iem(a, ged(b, c)) | ged(lem(a, b), lem(a, c)). 


Solution 

Proof: Write r for the right-hand side. Observe that 
a | lem(a, b), and 
a | lem(a,c), so 


a | gcd(Icm(a, b), lem(a, c)) = r. 


Similarly 
gcd(b, c) | b, implying 
gcd(b, c) | lem(a, b), and 
gcd(b, c) | c, implying 
gcd(b, c) | lem(a,c), which together give 
gcd(b, c) | ged(Iem(a, b), lem(a, c)) = r. 
Since a | r and gcd(b, c) | r, from the definition of lcm we get Icm(a, ged(b, c)) | 
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C.5.3 Adding and subtracting 
Let a,bE N withO0<a<b. 


1. Prove or disprove: gcd(a, b) = gcd(b — a, b). 


2. Prove or disprove: lem(a,b) = lem(a, b — a). 


Solution 


1. Proof: Let g = gcd(a,b). Then g| a and g | b, so g | (b—a) as well. So 
g is a common divisor of b— a and b. To show that it is the greatest 
common divisor, let h | band h | (b—a). Then h | a since a = b+ (b—a). 
It follows that h | gcd(a, b), which is g. 


2. Disproof: Let a = 2 and b= 5. Then lem(2,5) = 10 but lem(5— 2,5) = 
lem (3,5): 1b 2 10; 


C.6 Assignment 6: due Thursday, 2013-10-31, at 
5:00 pm 


C.6.1 Factorials mod n 


Let n €N. Recall that n! = []/_, 7. Show that, if n is composite and n > 9, 
then 
n 
Lol! =0 (mod n). 


Solution 


Let n be composite. Then there exist natural numbers a,b > 2 such that 
n = ab. Assume without loss of generality that a < b. 

For convenience, let k = |n/2]|. Since b= n/a and a > 2, b < n/2; but b 
is an integer, so b < n/2 implies b < |n/2| =k. It follows that both a and b 
are at most k. 

We now consider two cases: 


1. If a # 6b, then both a and 6b appear as factors in k!. So k! = 
abl ]h<i<z,igfa,s} 4 giving ab | k!, which means n | k! and k! = 0 
(mod n). 


2. Ifa=b, then n = a?. Since n > 9, we have a > 3, which means a > 4 
since a is a natural number. It follows that n > 4a and k > 2a. Soaand 
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2a both appear in the product expansion of k!, giving k! mod 2a? = 0. 
But then k! mod n = k! mod a? = (k! mod 2a?) mod a? = 0. 
C.6.2 Indivisible and divisible 


Prove or disprove: If A and B are non-empty, finite, disjoint sets of prime 
numbers, then there is a natural number n such that 


nmoda=1 
for every a in A, and 
n mod b=0 


for every b in B. 


Solution 


Proof: Let m, =[[gc4a and m2 = [pep 0. Because A and B are disjoint, 
mj, and mz have no common prime factors, and gcd(m,,mz2) = 1. So by the 
Chinese Remainder Theorem, there exists some n with 0 <n < m mz such 
that 


nmodm,=1 


n mod m2 = 0 


Then n mod a = (n mod m;) mod a = 1 mod a= 1 for any a in A, and 
similarly n mod b = (n mod mg) mod b = 0 mod b = 0 for any b in B. 

(It’s also possible to do this by applying the more general version of the 
CRT directly, since each pair of elements of A and B are relatively prime.) 


C.6.3 Equivalence relations 


Let A be a set, and let R and S be relations on A. Let T be a relation on A 
defined by x7T'y if and only if xRy and «Sy. 

Prove or disprove: If R and S' are equivalence relations, then T is also 
an equivalence relation. 


Solution 


Proof: The direct approach is to show that T is reflexive, symmetric, and 
transitive: 
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1. Reflexive: For any 2, eRx and «Sx, so «Tx. 


2. Symmetric: Suppose «Ty. Then «Ry and «Sy. Since R and S are 
symmetric, yRa and ySx. But then yTx. 


3. Transitive: Let xTy and yIz. Then rRy and yRz implies xRz, and 
similarly «Sy and ySz implies «Sz. So xRz and «Sz, giving Tz. 


Alternative proof: It’s also possible to show this using one of the alter- 
native characterizations of an equivalence relation from Theorem 9.4.1. 
Since R and S are equivalence relations, there exist sets B and C and 
functions f : A— Band g: A—-C such that rRy if and only if f(x) = f(y) 
and «Sy if and only if g(x) = g(y). Now consider the functionh: A> BxC 
defined by h(x) = (f(x), g(x)). Then h(x) = h(y) if and only if (f(x), g(x)) = 
(f(y), 9(y)), which holds if and only if f(x) = f(y) and g(x) = g(y). But 
this last condition holds if and only if Ry and «Sy, the definition of xT'y. 
So we have h(x) = h(y) if and only if «Ty, and T’ is an equivalence relation. 


C.7 Assignment 7: due Thursday, 2013-11-07, at 
5:00 pm 


C.7.1 Flipping lattices with a function 


Prove or disprove: For all lattices S and 7, and all functions f : S > T, if 


f(xVy) = f(x) A fy) for all z,y € S, 
then 


r<y > fly) < f(z) for all z,y € S. 


Solution 


Let S, T, f be such that f(a V y) = f(x) A f(y) for all z,y€ S. 

Now suppose that we are given some x,y € S with x < y. 

Recall that x V y is the minimum z greater than or equal to both x 
and y; so when x < y, y > x and y > y, and for any z with z > x 
and z > y, z> y, and y = «Vy. From the assumption on f we have 
f(y) = f(@Vy) = fla) A fy). 

Now use the fact that f(a) A f(y) is less than or equal to both f(x) and 


A 
f(y) to get f(y) = f(x) A fly) < f(a). 


APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2013 344 


C.7.2 Splitting graphs with a mountain 


Recall that a graph G = (V, F) is bipartite if V can be partitioned into 
disjoint sets L and R such that every edge uv has one endpoint in LZ and the 
other in R. 

A graph homomorphism f : G — G’ from a graph G = (V,E) toa 
graph G’ = (V’, E’) is a function f : V > V’ such that, for every wv € E, 
flu)f(v) € B. 

Prove or disprove: A graph G is bipartite if and only if there exists a 
graph homomorphism f :G—- Ko. 


Solution 


Denote the vertices of Ko by @ and r. 

If G is bipartite, let L, R be a partition of V such that every edge has 
one endpoint in Z and one in R, and let f(x) = @ if x isin Land f(x) =r if 
xz isin R. 

Then if uv € EF, either u € L and v € R or vice versa; In either case, 
f(u)f(v) = er € Ko. 

Conversely, suppose f : V > {¢,r} is a homomorphism. Define L = 
f-'(@ and R= f7-'(r); then L, R partition V. Furthermore, for any edge 
uv € E, because f(u)f(v) must be the unique edge fr, either f(u) = @ and 
f(v) =r or vice versa. In either case, one of u,v is in L and the other is in 
R, so G is bipartite. 


C.7.3. Drawing stars with modular arithmetic 


For each pair of natural numbers m and k with m > 2 and0<k<™, let 
Sm,k be the graph whose vertices are the m elements of Z, and whose edges 
consist of all pairs (7,7 + k), where the addition is performed mod m. Some 
examples are given in Figure C.1 

Give a simple rule for determining, based on m and k, whether or not 
Sm,k is connected, and prove that your rule works. 


Solution 


The rule is that S,,;, is connected if and only if gcd(m,k) = 1. 

To show that this is the case, consider the connected component that 
contains 0; in other words, the set of all nodes v for which there is a path 
from 0 to v. 
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eF pays 
rod WS 


Figure C.1: Examples of S,,,, for Problem C.7.3 


Lemma C.7.1. There is a path from 0 tov in Sm, if and only if there is 
a number a such that v = ak (mod m). 


Proof. To show that a exists when a path exists, we’ll do induction on the 
length of the path. If the path has length 0, then v =0=0-k (mod m). If 
the path has length n > 0, let u be the last vertex on the path before v. By 
the induction hypothesis, u = bk (mod m) for some b. There is an edge from 
u to v if and only ifv=utk (mod m). Sov = bk +k = (bk +1) (mod m). 

Conversely, if there is some a such that v = ak (mod m), then there is a 
path 0,k,...,ak from 0 to v in Spx. 


Now suppose gcd(m,k) = 1. Then k has a multiplicative inverse mod 
m, so for any vertex v, letting a = k~!v (mod m) gives an a for which 
ak = k~'vk =v (mod m). So in this case, there is a path from 0 to every 
vertex in S,,,, showing that S,,, is connected. 

Alternatively, suppose that gcd(m,k) 4 1. Let g = gcd(m,k). Then 
if v = ak (mod m), v = ak — qm for some q, and since g divides both k 
and m it also divides ak — qm and thus v. So there is no path from 0 to 1, 
since g > 1 implies g does not divide 1. This gives at least two connected 
components, and S,,,, is not connected. 
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C.8 Assignment 8: due Thursday, 2013-11-14, at 
5:00 pm 


C.8.1 Two-path graphs 


Define a two-path graph to be a graph consisting of exactly two disjoint 
paths, each containing one or more nodes. Given a particular vertex set 
of size n, we can consider the set of all two-path graphs on those vertices. 
For example, when n = 2, there is exactly one two-path graph, with one 
vertex in each length-0 path. When n = 3, there are three: each puts one 
vertex in a path by itself and the other two in a length-1 path. For larger n, 
the number of two-path graphs on a given set of n vertices grows quickly. 
For example, there are 15 two-path graphs on four vertices and 90 two-path 
graphs on five vertices (see Figure C.2). 

Let n > 3. How many two-path graphs are there on n vertices? 

Give a closed-form expression, and justify your answer. 


Solution 


First let’s count how many two-path graphs we get when one path has size k 
and the other n — k; to avoid duplication, we’ll insist k <n —k. 

Having fixed k, we can specify a pair of paths by giving a permutation 
U1... Un of the vertices; the first path consists of v,...vz, while the second 
consists of vp41...Un. This might appear to give us n! pairs of paths for 
each fixed k. However, this may overcount the actual number of paths: 


e If k > 1, then we count the same path twice: once as vj... vz, and 
once as vz...V1,. So we have to divide by 2 to compensate for this. 


e The same thing happens when n — k > 1; in this case, we also have to 
divide by 2. 


e Finally, if k =n—k, then we count the same pair of paths twice, since 
UL... Uk; Ukt1++-Un gives the same graph as Upg41...Un, V1... UR. SO 
here we must again divide by 2. 


For odd graphs, the last case doesn’t come up. So we get n!/2 graphs 
when k = 1 and n!/4 graphs for each larger value of k. For even graphs, 
we get n!/2 graphs when k = 1, n!/4 graphs when 1 < k < n/2, and n!/8 
graphs when k = n/2. Adding up the cases gives a total of 


1 1 n—-1 n+l 
ie th) =H = 
8 Gta ( 2 )) an: 
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when n is odd, and 


when n is even. 


So we get the same expression in each case. We can simplify this further 
to get 
Naame (C.8.1) 
8 
two-path graphs on n > 3 vertices. 

The simplicity of (C.8.1) suggests that there ought to be a combinatorial 
proof of this result, where we take a two-path graph and three bits of 
additional information and bijectively construct a permutation of n + 1 
values. 

Here is one such construction, which maps the set of all two-path graphs 
with vertices in [n] plus three bits to the set of all permutations on [n + 1]. 
The basic idea is to paste the two paths together in some order with n 
between them, with some special handling of one-element paths to cover 
permutations that put n at one end of the other. Miraculously, this special 
handling exactly compensates for the fact that one-element paths have no 
sense of direction. 


1. For any two-path graph, we can order the two components on which 
contains 0 and which doesn’t. Similarly, we can order each path by 
starting with its smaller endpoint. 


2. To construct a permutation on [n + 1], use one bit to choose the order 
of the two components. If both components have two or more elements, 
use two bits to choose whether to include them in their original order or 
the reverse, and put n between the two components. If one component 
has only one element x, use its bit instead to determine whether we 
include x,n or n,x in our permutation. 


In either case we can reconstruct the original two-path graph uniquely by 
splitting the permutation at n, or by splitting off the immediate neighbor of n 
if n is an endpoint; this shows that the construction is surjective. Furthermore 
changing any of the three bits changes the permutation we get; together with 
the observation that we can recover the two-path graph, this shows that the 
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construction is also injective. So we have that the number of permutations 
on n+ 1 values is 2? = 8 times the number of two-path graphs on n vertices, 
giving (n+ 1)!/8 two-path graphs as claimed. 

(For example, if our components are 0,1, and 2,3,4, and the bits are 101, 
the resulting permuation is 4, 3,2,5,0,1. If the components are instead 3 and 
2,0,4,1, and the bits are 011, then we get 5,3,1,4,0,2. In either case we 
can recover the original two-path graph by deleting 5 and splitting according 
to the rule.) 

Both of these proofs are pretty tricky. The brute-force counting approach 
may be less prone to error, and the combinatorial proof probably wouldn’t 
occur to anybody who hadn’t already seen the answer. 


C.8.2 Even teams 


A group of sports astronomers on Gliese 667 Cc are trying to reconstruct 
American football. Based on sketchy radio transmissions from Earth, they 
have so far determined that it (a) involves ritualized violence, (b) sometimes 
involves tailgate parties, and (c) works best with even teams. Unfortunately, 
they are still confused about what even teams means. 

Suppose that you have n candidate football players, that you pick k of 
them, and then split the & players into two teams (crimson and blue, say). 
As a function of n, how many different ways are there to do this with & being 
an even number? 

For example, when n = 2, there are five possibilities for the pairs of 
teams: (0,0), (0, {a, b}), (fat, {b}), ({0} , {a}), and ({a, } , 0). 

(Hint: Consider the difference between the number of ways to make k 
even and the number of ways to make k odd.) 


Clarification added 2013-11-13: Ideally, your answer should be in closed 
form. 


Solution 


We’ll take the hint, and let E(n) be the number of team assignments that 
make k even and U(n) being the number that make k uneven, or odd. Then 
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we can compute 


We also have E(n) + U(n) = h—o (7)2” = (1 + 2)” = 3”. Solving for 
E(n) gives 
<< 

To make sure that we didn’t make any mistakes, it may be helpful to 
check a few small cases. For n = 0, we have one even split (nobody on 
either team), and (3° + (—1)°)/2 = 2/2 = 1. For n = 1, we have the 
same even split, and (3! + (—1)')/2 = (3-—1)/2 =1. For n = 2, we get 
a five even splits ((0,0), ({x}, {yt), {yt {z}), {x,y} 0), (0 {x, y})), and 
(3? + (—1)?)/2 = (9+ 1)/2 =5. This is not a proof that (C.8.2) will keep 
working forever, but it does suggest that we didn’t screw up in some obvious 
way. 


E(n) (C.8.2) 


C.8.3 Inflected sequences 


Call a sequence of three natural numbers ag, a1, @2 inflected if ag > a, < ag 
or ag < a1 => Qo. 

As a function of n, how many such inflected sequences are there with 
a1, 42,43 € [n]? 


Solution 


Let S be the set of triples (ao, a1, a2) in [n]° with ag > a, < ag and let T 
be the set of triples with ap < a, > ag. Replacing each a; with (n — 1) — a; 
gives a bijection between S and T, so |S| = |T|. Computing |T| is a little 
easier, so we’ll do that first. 
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To compute |7|, pick a; first. Then ap and az can be any elements of 
[n] that are less than or equal to a;. Summing over all possible values of a1 
gives 


n-1 n 
Det ty =r 
4=0 i=0 
= in? + aie +n 
3 2 6 


The last step uses (6.4.2). 
The number we want is |S UT| = |$|+|T|—|S OT]. For a triple to be 
in |S T]|, we must have ag = a, = ag; there are n such triples. So we have 


1 1 1 
SUT) =2(5n8 + on? + =n) n 
_23,,2 2 
= ait +n ait 


C.9 Assignment 9: due Thursday, 2013-11-21, at 
5:00 pm 


For problems that ask you to compute a value, closed-form expressions are 
preferred, and you should justify your answers. 


C.9.1 Guessing the median 


The median of a set S' of n distinct numbers, when n is odd, is the element 
x of S with the property that exactly (n —1)/2 elements of S are less than x 
(which also implies that exactly (n — 1)/2 elements of S are greater than 2). 

Consider the following speedy but inaccurate algorithm for guessing the 
median of a set S, when n = |S| is odd and n > 3: choose a three-element 
subset R of S uniformly at random, then return the median of R. What is 
the probability, as a function of n, that the median of R is in fact the median 
of S? 


Solution 


There are (3) = nnn?) choices for R, all of which are equally likely. So 
we want to count the number of sets R for which median(R) = median(S). 
Each such set contains median($), one of the (n — 1)/2 elements of S$ 


less than median(S), and one of the (n — 1)/2 elements of S greater than 
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median(S). So there are a choices of R that cause the algorithm to 


work. The probability of picking one of these good sets is 


(n-1P/4 3 n-1 
n(n—1)(n—2)/6 2 n(n—2)° 


As a quick test, when n = 3, this evaluates to 3 . rat = 1, which is what 
we’d expect given that there is only one three-element subset of S in this 
case. This is also a case where the algorithm works much better than the 
even dumber algorithm of just picking a single element of S at random, 
which succeeds with probability 1/n, or 1/3 in this case. For larger n the 
performance of median-of-three is less convincing, converging to a 3.(1 /n) 
probability of success in the limit. 


C.9.2 Two flushes 


A standard poker deck has 52 cards, which are divided into 4 suits of 13 
cards each. Suppose that we shuffle a poker deck so that all 52! permutations 
are equally likely. We then deal the top 5 cards to you and the next 5 cards 
to me. 

Define a flush to be five cards from the same suit. Let A be the event 
that you get a flush, and let B be the event that I get a flush. 


1. What is Pr[B | A]? 


2. Is this more or less than Pr[B]? 


Solution 


Recall that 
Pr [Bn A] 
Pr[A] ~ 
Let’s start by calculating Pr [A]. For any single suit s, there are (13), 
ways to give you 5 cards from s, out of (52) 5 ways to give you 5 cards, 
assuming in both cases that we keep track of the order of the cards.“ So the 
event A, that you get only cards in s has probability 


(13); 
(52); 


3 This turns out to be pretty hard to do in practice [BD92], but we’ll suppose that we 
can actually do it. 

“If we don’t keep track of the order, we get (ey choices out of he possibilities; these 
divide out to the same value. 


Pr([B| A] = 
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Since there are four suits, and the events A, are disjoint for each suit, 
we get 


aie Opae 
Pr [A] DP [(Aj=a 52). 


For Pr[An Bl}, let C.; be the event that your cards are all from suit s 
and mine are all from suit t. Then 


(13)10 if s =t, and 
Pr [Csi] = (52)19 


(13) ( 
13 5° 13) c 
eee if Ss # t. 


Summing up all 16 cases gives 
as (13) 10 + 12: (13)5 : (13)5 


Pr[An B] = (2), 


Now divide to get 


je ) 

52 

Pr[B | A] = a 
("e5) 


(2 +3. (13)5) 


@ 
5 
_ (8)5+3-(13)5 
(47); 
Another way to get (C.9.1) is to argue that once you have five cards of a 
particular suit, there are (47); equally probable choices for my five cards, of 


which (8); give me five cards from your suit and 3- (13), give me five cards 
from one of the three other suits. 


However we arrive at (C.9.1), we can evaluate it numerically to get 
(8)5 +3 - (13)5 

(47); 
_ 8+7-6-5+443-(13-12-11-10-8) 


47-46 - 45-44-43 
_ 6720 +3 - 154440 


184072680 
470040 


184072680 
3917 


~ 1533939 
~ 0.00255356. 


(C.9.1) 


Pr[B | A] = 
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This turns out to be slightly larger than the probability that I get a flush 
without conditioning, which is 
4- (13 
Pr |B] = eis 
(52), 
_ 4-154400 


~ 311875200 
33 


~ 16660 
~ ().00198079. 


So Pr[B | A] is greater than Pr[B]. 

This is a little surprising. If you have a flush, it seems like there should 
be fewer ways for me to make a flush. But what happens is that your 
flush in spades (say) actually makes my flush in a non-spade suit more 
likely—because there are fewer spades to dilute my potential heart, diamond, 
or club flush—and this adds more to my chances than the unlikelihood of 
getting a second flush in spaces subtracts. 


C.9.3 Dice and more dice 


Let n be a positive integer, and let Do, Di,... Dn be independent random 
variables representing n-sided dice. Formally, Pr[D; = k] = 1/n for each i 
and each k in {1,...,n}. 


Let 
Do 


Say. De 
i=1 


As a function of n, what is E [5S]? 
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Solution 


Expand 


T 
Sih 
it 

cs 


Appendix D 


Sample exams from Fall 2013 


These are exams from the Fall 2013 version of CPSC 202. Some older exams 
can be found in Appendices E and F. 


D.1 CS202 Exam 1, October 17th, 2013 


Write your answers on the exam. Justify your answers. Work alone. Do not 
use any notes or books. 

There are four problems on this exam, each worth 20 points, for a total 
of 80 points. You have approximately 75 minutes to complete this exam. 


D.1.1 A tautology (20 points) 


Use a truth table to show that the following is a tautology: 


PV(PSQ)vVQ 


Solution 
PQ P8SQ PV(P#8Q) PV(P8SQ)vVQ 
0 O 1 1 1 
0 1 0 0 1 
1 0 0 1 1 
1 1 1 1 1 


396 
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D.1.2 A system of equations (20 points) 
Show that the equations 
x+y=0 (mod m) 
x—y=0 (mod m) 


have exactly one solution with 0 < x,y < m if and only if m is odd. 


Solution 


Add the equations together to get 
2x =0 (mod m) 


If m is odd, then gcd(m,2) = 1 and 2 has a multiplicative inverse in 
Zm, (which happens to be (m+ 1)/2). So we can multiply both sides of the 
equation by this inverse to get x = 0 (mod m). Having determined x, we 
can then use either of the given equations to compute y = 0 (mod m), giving 
the unique solution. 

If m is even, then « = y = 0 and xz = y = m/2 are two distinct solutions 
that satisfy the given equations. 


D.1.3. A sum of products (20 points) 


Let n € N. Give a closed-form expression for 
n 4 
S12 
i=1 j=1 


Justify your answer. 
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Solution 


Using the definition of exponentiation and the geometric series formula, we 
can compute 


n 4 n : 
Doe doe 
i=1j7=1 i=1 


n-1 


= a 
i=0 
n-1 ; 
— ae 
1=0 
or 
= 7 
=2.(2"—1) 
=grtl_9. 


= 


D.1.4 A subset problem (20 points) 
Let A and B be sets. Show that if ANC C BNC for all sets C, then A C B. 


Solution 


Suppose that for all C, ANC C BNC. In particular, let C = A. Then 
A=ANACBNOA. IfxeéA, then zs € BNA, giving r€ B. SOACB. 
(Other choices for C also work.) 

An alternative proof proceeds by contraposition: Suppose A Z B. Then 
there is some x in A that is not in B. But then AN{x} = {x} and Bn{zx} = 0, 
so AN {x} Z BN {a}. 


D.2 CS202 Exam 2, December 4th, 2013 


Write your answers on the exam. Justify your answers. Work alone. Do not 
use any notes or books. 

There are four problems on this exam, each worth 20 points, for a total 
of 80 points. You have approximately 75 minutes to complete this exam. 
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D.2.1 Minimum elements (20 points) 


Let (S,<) be a partially-ordered set. Recall that for any subset T of S, x is 
a minimum element of T if x is in T and x < y for all y in T. 

Prove or disprove: If every nonempty subset of S has a minimum element, 
then S' is totally ordered. 


Solution 


We need to show that any two elements of S are comparable. 

If S is empty, the claim holds vacuously. 

Otherwise, let x and y be elements of S. Then {x,y} is a nonempty 
subset of S, and so it has a minimum element z. If z = a, then x < y; if 
z=y, then y < x. In either case, x and y are comparable. 


D.2.2 Quantifiers (20 points) 


Show that exactly one of the following two statements is true. 


VeeEZ:aeZ:xr<y (D.2.1 
qr EZ: VyeZ:iu<y 


RS . 
YY" sma 


Solution 


First, we’ll show that (D.2.1) is true. Given any x € Z, choose y= z+ 1. 
Then x < y. 

Next, we’ll show that (D.2.2) is not true, by showing that its negation 
is true. Negating (D.2.2) gives Vr € Z: dye€Z:a €y. Given any x € Z, 
choose y= x. Then x < y. 


D.2.3 Quadratic matrices (20 points) 


Prove or disprove: For all n x n matrices A and B with real elements, 


(A+B)? = A? + 2AB + B?, (D.2.3) 


Solution 


We don’t really expect this to be true, because the usual expansion (A+B)? = 
A? + AB+ BA-+ B? doesn’t simplify further since AB does not equal BA 
in general. 
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In fact, we can use this as the basis for a disproof. Suppose 
(A+B)? = A? +2AB + B?. 
Then 
A? +2AB+ B? = A?+AB+BA+ B’. 
Subtract A? + AB + B? from both sides to get 
AB = BA. 


This implies (A+ B)? = A?+2AB+ B? only if A and B commute. Since 
there exists at least one pair of square real matrices A and B that don’t 
commute, (D.2.3) does not hold in general. 

For a more direct disproof, we can choose any pair of square real matrices 
that don’t commute, and show that they give different values for (A + B)? 
and A? +2AB+ B?. For example, let 


11 


Then 
2 ol? 
i 
a+ ep=[) ; 
_ {0 4 
~ 18 Al’ 
but 
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D.2.4 Low-degree connected graphs (20 points) 


How many connected graphs contain no vertices with degree greater than 
one? 


Solution 


There are three: The empty graph, the graph with one vertex, and the graph 
with two vertices connected by an edge. These enumerate all connected 
graphs with two vertices or fewer (the other two-vertex graph, with no edge, 
is not connected). 

To show that these are the only possibilities, suppose that we have a 
connected graph G with more than two vertices. Let u be one of these 
vertices. Let v be a neighbor of wu (if u has no neighbors, then there is no 
path from u to any other vertex, and G is not connected). Let w be some 
other vertex. Since G is connected, there is a path from u to w. Let w’ be 
the first vertex in this path that is not u or v. Then w’ is adjacent to wu or v; 
in either case, one of u or v has degree at least two. 


Appendix E 


Midterm exams from earlier 
semesters 


Note that topics covered may vary from semester to semester, so the ap- 
pearance of a particular topic on one of these sample midterms does not 
necessarily mean that it may appear on a current exam. 


E.1 Midterm Exam, October 12th, 2005 


Write your answers on the exam. Justify your answers. Work alone. Do not 
use any notes or books. 

There are four problems on this exam, each worth 20 points, for a total 
of 80 points. You have approximately 50 minutes to complete this exam. 


E.1.1 A recurrence (20 points) 


Give a simple formula for T(n), where: 


BO) =e 

T(n) = 38T(n— 1) + 2”, when n > 0. 
Solution 
Using generating functions 
Let F(z) = P29 T(n)z”, then 


1 
1—2z° 


F(z) = 3zF(z) + 


362 
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Solving for F’ gives 
1 
(1 — 2z)(1 — 32) 
2. 3 
~ =e ae 
From the generating function we can immediately read off 


Ri SoBe 20 Satta. 


eS 


Without using generating functions 


It is possible to solve this problem without generating functions, but it’s 
harder. Here’s one approach based on forward induction. Start by computing 
the first few values of T(n). We’ll avoid reducing the expressions to make it 
easier to spot a pattern. 


T(0) = 

T(1)=3+2 

T(2)=3?+3-2+2? 
T(3)=3? $37 943+2? 49° 

EA) Se 80a oF ee? a OF 


At this point we might guess that 


n 


TG = aa eae 57(2/3)! = 3” ( 
k=0 


k=0 


{i Q/syo 
1 — (2/3) 
A guess is not a proof; to prove that this guess works we verify T(0) = 


31-2! = 3-2 = 1 and T(n) = 3T(n — 1) + 2” = 3(38" — 27) 4 2% = 
gntl_9.9gn — gntl _ gntl 


E.1.2 An induction proof (20 points) 


Prove by induction on n that n! > 2” for all integers n > ng, where ng is an 
integer chosen to be as small as possible. 


Solution 


Trying small values of n gives 0! = 1 = 2° (bad), 1! = 1 < 2! (bad), 
2! = 2 < 2? (bad), 3! = 6 < 23 (bad), 4! = 24 > 24 = 16 (good). So we'll 
guess No = 4 and use the n = 4 case as a basis. 

For larger n, we have n! = n(n — 1)! > n2™-1>2.27-1= 2”, 
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E.1.3. Some binomial coefficients (20 points) 


Prove that k (7) = n(e—1) when 1<k<n. 


Solution 


There are several ways to do this. The algebraic version is probably cleanest. 


Combinatorial version 


The LHS counts the way to choose k of n elements and then specially mark 
one of the k. Alternatively, we could choose the marked element first (n 
choices) and then choose the remaining k — 1 elements from the remaining 
n — 1 elements (es) choices); this gives the RHS. 


Algebraic version 


n n! n! n—1)! n—-1 
Compute k(;) = k- aaa = wcpimce =” acter = n(z")- 


Generating function version 


Observe that Rip k(f)z* = z£(1+z)" = en(1tz)""! = Wea n(™)zktt = 
yt Tess) z*, Now match z* coefficients to get the desired result. 


E.1.4 A probability problem (20 points) 


Suppose you flip a fair coin n times, where n > 1. What is the probability 
of the event that both of the following hold: (a) the coin comes up heads at 
least once and (b) once it comes up heads, it never comes up tails on any 
later flip? 


Solution 


For each i € {1...n}, let A; be the event that the coin comes up heads for 
the first time on flip 7 and continues to come up heads thereafter. Then 
the desired event is the disjoint union of the A;. Since each A; is a single 
sequence of coin-flips, each occurs with probability 2~”. Summing over all 7 
gives a total probability of n2~”. 


APPENDIX E. MIDTERM EXAMS FROM EARLIER SEMESTERS 365 


E.2.) Midterm Exam, October 24th, 2007 


Write your answers on the exam. Justify your answers. Work alone. Do not 
use any notes or books. 

There are four problems on this exam, each worth 20 points, for a total 
of 80 points. You have approximately 50 minutes to complete this exam. 


E.2.1 Dueling recurrences (20 points) 


Let 0 < $(0) < T(0), and suppose we have the recurrences 


S(n+1) =aS(n) + f(n) 
T(n+1) = bT(n) + g(n), 


where 0 <a< band0< f(n) < g(n) for alln EN. 
Prove that S(n) < T(n) for alln EN. 


Solution 


We’ll show the slightly stronger statement 0 < S(n) < T(n) by induction on 
n. The base case n = 0 is given. 

Now suppose 0 < S(n) < T(n); we will show the same holds for n + 1. 
First observe S(n +1) = aS(n)+ f(n) > 0 as each variable on the right-hand 
side is non-negative. To show T(n +1) > S(n +1), observe 


+ g(n) 
2 aT(n) + f(n) 
> aS(n) + f(n) 
= S(n+1). 


T(n +1) = b(n) 


Note that we use the fact that 0 < T(n) (from the induction hypothesis) 
in the first step and 0 < a in the second. The claim does not go through 
without these assumptions, which is why using S(n) < T(n) by itself as the 
induction hypothesis is not enough to make the proof work. 


E.2.2 Seating arrangements (20 points) 


A group of k students sit in a row of n seats. The students can choose 
whatever seats they wish, provided: (a) from left to right, they are seated in 
alphabetical order; and (b) each student has an empty seat immediately to 
his or her right. 
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For example, with 3 students A, B, and C and 7 seats, there are exactly 4 
ways to seat the students: A-B-C-, A-B-C-, A-B-C-, and -A-B-C-. 

Give a formula that gives the number of ways to seat k students in n 
seats according to the rules given above. 


Solution 


The basic idea is that we can think of each student and the adjacent empty 
space as a single width-2 unit. Together, these units take up 2k seats, leaving 
n — 2k extra empty seats to distribute between the students. There are a 
couple of ways to count how to do this. 


Combinatorial approach 
Treat each of the & student-seat blocks and n — 2k extra seats as filling one 
of k + (n— 2k) =n —k slots. There are exactly ea ways to do this. 


Generating function approach 


Write z+2? for the choice between a width-1 extra seat and a width-2 student- 
seat block. For a row of n — k such things, we get the generating function 


(z co” al = mse af Bynes _~ yn-k pia a = yee rig ee k+t 


The z” coefficient is obtained when i = k, giving () ways to fill out 
exactly n seats. 


E.2.3 Non-attacking rooks (20 points) 


Place n rooks at random on an n x n chessboard (i.e., an n x n grid), so 
that all ( placements are equally likely. What is the probability of the 
event that every row and every column of the chessboard contains exactly 
one rook? 


Solution 


We need to count how many placements of rooks there are that put exactly 
one rook per row and exactly one rook per column. Since we know that 
there is one rook per row, we can specify where these rooks go by choosing 
a unique column for each row. There are n choices for the first row, n — 1 
remaining for the second row, and so on, giving n(n — 1)---1 =n! choices 
altogether. So the probability of the event is nl /(™) = (n? — n)!/(n?)!. 
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E.2.4 Subsets (20 points) 
Let AC B. 


1. Prove or disprove: There exists an injection f: A — B. 


2. Prove or disprove: There exists a surjection g: B—> A. 


Solution 
1. Proof: Let f(x) = x. Then f(x) = f(y) implies x = y and f is 
injective. 
2. Disproof: Let B be nonempty and let A = 9. Then there is no function 
at all from B to A, surjective or not. 


E.3. Midterm Exam, October 24th, 2008 


Write your answers on the exam. Justify your answers. Work alone. Do not 
use any notes or books. 

There are four problems on this exam, each worth 20 points, for a total 
of 80 points. You have approximately 50 minutes to complete this exam. 


E.3.1 Some sums (20 points) 


Let ao, a1,... and bo, b;,... be sequences such that for 7 in N, a; < }. 
Let A; = 6 a; and let Bj = 0 bj. 
Prove or disprove: For all i in N, A; < Bj. 


Solution 


Proof: By induction on i. For i = 0 we have Ao = ao < bo = Bo. Now 
suppose A; < Bi. Then Aj44 = Be aj = Dja0 Aj + Git = Ai t+ iti < 
Bj + ditt = Dao by + bj41 = Djto by = Bj: 


E.3.2 Nested ranks (20 points) 


You are recruiting people for a secret organization, from a population of n 
possible recruits. Out of these n possible recruits, some subset M will be 
members. Out of this subset M, some further subset C' will be members of 
the inner circle. Out of this subset C’, some further subset X will be Exalted 
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Grand High Maharajaraja Panjandrums of Indifference. It is possible that 
any or all of these sets will be empty. 

If the roster of your organization gives the members of the sets M, C, 
and X, and if (as usual) order doesn’t matter within the sets, how many 
different possible rosters can you have? 


Solution 


There is an easy way to solve this, and a hard way to solve this. 

Easy way: For each possible recruit x, we can assign x one of four states: 
non-member, member but not inner circle member, inner circle member but 
not EGHMPol, or EGHMPol. If we know the state of each possible recruit, 
that determines the contents of M, C, X and vice versa. It follows that 
there is a one-to-one mapping between these two representations, and that 
the number of rosters is equal to the number of assignments of states to all 
n potential recruits, which is 4”. 

Hard way: By repeated application of the binomial theorem. Expressing 
the selection process in terms of choosing nested subsets of m, c, and x 
members, the number of possible rosters is 


SIE Et@EC)4 


= 3 (7 )o+2" 


E.3.3 Nested sets (20 points) 

Let A, B, and C be sets. 
1. Prove or disprove: If A € B, and B CC, then ACC. 
2. Prove or disprove: If A C B, and B CC, then ACC. 


Solution 


1. Disproof: Let A = {0}, B = {A} = {{@}}, and C= B. Then Ac B 
and B CC, but A Z C, because # € A but 0 ZC. 
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2. Proof: Let « € A. Then since A C B, we have x € B, and since B C C, 
we have x € C. It follows that every x in A is also in C, and that A is 
a subset of C. 


E.3.4 An efficient grading method (20 points) 


A test is graded on a scale of 0 to 80 points. Because the grading is 
completely random, your grade can be represented by a random variable X 
with 0 < X < 80 and E[X] = 60. 


1. What is the maximum possible probability that X = 80? 


2. Suppose that we change the bounds to 20 < X < 80, but E[X] is still 
60. Now what is the maximum possible probability that X = 80? 


Solution 


1. Here we apply Markov’s inequality: since X > 0, we have Pr[X > 
80] < eel = o = 3/4. This maximum is achieved exactly by letting 
X = 0 with probability 1/4 and 80 with probability 3/4, giving ELX] = 


(1/4) -0 + (3/4) - 80 = 60. 


2. Raising the minimum grade to 20 knocks out the possibility of getting 0, 
so our previous distribution doesn’t work. In this new case we can apply 
Markov’s inequality to Y = X — 20 > 0, to get Pr[X > 80] = Pr[Y > 
60] < El = # = 2/3. So the extreme case would seem to be that we 
get 20 with probability 1/3 and 80 with probability 2/3. It’s easy to 
check that we then get ELX] = (1/3) - 20 + (2/3) - 80 = 180/3 = 60. So 
in fact the best we can do now is a probability of 2/3 of getting 80, 
less than we had before. 


E.4 Midterm exam, October 21st, 2010 


Write your answers on the exam. Justify your answers. Work alone. Do not 
use any notes or books. 

There are four problems on this exam, each worth 20 points, for a total 
of 80 points. You have approximately 75 minutes to complete this exam. 
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E.4.1 A partial order (20 points) 


Let S CN, and for any x,y € N, define x = y if and only if there exists 
z€S such that r+ z= y. 

Show that if < is a partial order, then (a) 0 is in S and (b) for any x,y 
in S,a+yisin S. 


Solution 


If X is a partial order, then by reflexivity we have x < x for any x. But 
then there exists z € S such that «+ z=, which can only happen if z = 0. 
Thus 0 € S. 

Now suppose x and y are both in S. Then 0+ x = z implies 0 ~ z, and 
x+y=a2+y implies x x x+y. Transitivity of < gives 0 Xx «+ y, which 
occurs only if some z such that 0+ z= a+ y is in S. The only such z is 
x+y,soxr+yisin S. 


E.4.2 Big exponents (20 points) 
Let p be a prime, and let 0 < a < p. Show that a??~! = a (mod p). 


Solution 


Write a2?-! = a?-laP-ta. If a 4 0, Euler’s Theorem (or Fermat’s Little 
Theorem) says a?~! = 1 (mod p), so in this case a?~!a?~!a = a (mod p). If 
a = 0, then (since 2p — 1 4 0), a??-! = 0 =a (mod p). 

E.4.3 At the playground (20 points) 


Let L(x, y) represent the statement “x likes y” and let T(x) represent the 
statement “x is tall,” where x and y range over a universe consisting of all 
children on a playground. Let m be “Mary,” one of the children. 


1. Translate the following statement into predicate logic: “If x is tall, 
then Mary likes x if and only if x does not like x.” 


2. Show that if the previous statement holds, Mary is not tall. 


Solution 


1. Va(T (es) = (Lone) & aL; 2))). 
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2. Suppose the previous statement is true. Let x = m, then T(m) => 
(L(m,m) = aL(m,m)). But L(m,m) = =L(m,m) is false, so T(m) 
must also be false. 


E.4.4 Gauss strikes back (20 points) 


Give a closed-form formula for ae k, assuming O<a< 0b. 


Solution 


Here are three ways to do this: 


1. Write R_ kas Sh kK- eT k and then use the formula R_, k = 
mises) to get 


b a +h 
k=a k=1 
= (b+ 1) ‘eae 
So 2 
_ b(b+ 1) —a(a—1) 
= 5 : 
2. Use Gauss’s trick, and compute 
b b b 
2S°k=S°>k+ 5 -(b+a-k) 
k=a k=a k=a 


= ere 
k= 
= (0 {pence 


Dividing both sides by 2 gives (brat A )(oFa) 


3. Write 2_, k as oa +k) =(b-—a+ljat as k. Then use the 


sum formula as before to turn this into (b-—a+1)a+ wenbner), 


Though these solutions appear different, all of them can be expanded to 
b?—a2+a+b 
—— 


Appendix F 


Final exams from earlier 
semesters 


Note that topics may vary from semester to semester, so the appearance 
of a particular topic on one of these exams does not necessarily indicate 
that it will appear on the second exam for the current semester. Note also 
that these exams were designed for a longer time slot—and were weighted 
higher—than the current semester’s exams; the current semester’s exams are 
likely to be substantially shorter. 


F.1 CS202 Final Exam, December 15th, 2004 


Write your answers in the blue book(s). Justify your answers. Work alone. 
Do not use any notes or books. 

There are seven problems on this exam, each worth 20 points, for a total 
of 140 points. You have approximately three hours to complete this exam. 


F.1.1 A multiplicative game (20 points) 


Consider the following game: A player starts with a score of 0. On each turn, 
the player rolls two dice, each of which is equally likely to come up 1, 2, 3, 
4, 5, or 6. They then take the product xy of the two numbers on the dice. 
If the product is greater than 20, the game ends. Otherwise, they add the 
product to their score and take a new turn. The player’s score at the end of 
the game is thus the sum of the products of the dice for all turns before the 
first turn on which they get a product greater than 20. 


372 
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1. What is the probability that the player’s score at the end of the game 
is zero? 


2. What is the expectation of the player’s score at the end of the game? 


Solution 


1. The only way to get a score of zero is to lose on the first roll. There 
are 36 equally probable outcomes for the first roll, and of these the 
six outcomes (4,6), (5,5), (5,6), (6,4), (6,5), and (6,6) yield a product 
greater than 20. So the probability of getting zero is 6/36 = 1/6. 


2. To compute the total expected score, let us first compute the expected 
score for a single turn. This is 


a cae 
a < 20]. 
i=1j=1 
where [ij < 20] is the indicator random variable for the event that 
ij < 20. 
I don’t know of a really clean way to evaluate the sum, but we can 
expand it as 


(>2:) i) HS j+5LI+0DI 


i=l = j=l 
=6-214+4-15+5-10+6-6 
= 126+ 60+ 50+ 36 

= 272: 


So the expected score per turn is 272/36 = 68/9. 


Now we need to calculate the expected total score; call this value S. 
Assuming we continue after the first turn, the expected total score for 
the second and subsequent turns is also S,, since the structure of the 
tail of the game is identical to the game as a whole. So we have 


S = 68/9 + (5/6)S, 


which we can solve to get S = (6-68)/9 = 136/3. 
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F.1.2 An equivalence in space (20 points) 


Let V be a k-dimensional vector space over the real numbers R with a 
standard basis £;. Recall that any vector Zin V can be represented uniquely 
as )7*_, 2%). Let f : V > R be defined by f(Z) = 0, |zi|, where the z; 
are the coefficients of Z in the standard representation. Define a relation 
~on V x V by 2, ~ Z if and only if f(7) = f(%). Show that ~ is an 
equivalence relation, i.e., that it is reflexive, symmetric, and transitive. 


Solution 


Both the structure of the vector space and the definition of f are irrelevant; 
the only fact we need is that Z| ~ % if and only if f(2) = f(#). Thus for 
all 7, Z~ Z since f(Z) = f(Z (reflexivity); for all 7 and Z, if ¥ ~ 7, then 
f(¥) = f(Z) implies f(Z) = f(y) implies 7 ~ ¥ (symmetry); and for all Z, 7, 
and 7, if #~ yand y ~ Z, then f(Z) = f(y) and f(y) = f(2), so f(#) = f(A) 
and # ~ Z (transitivity). 


F.1.3 A very big fraction (20 points) 


Use the fact that p = 274036583 _ 1 is prime to show that 


9 —9 
924036583 _ | 


924036582 


is an integer. 


Solution 


Let’s save ourselves a lot of writing by letting x = 24036583, so that p = 2*—1 


and the fraction becomes ‘ 
gr— 


9 —9 


Pp 
To show that this is an integer, we need to show that p divides the 
denominator, i.e., that 
g""" __9=0 (mod p). 
We'd like to attack this with Fermat’s Little Theorem, so we need to get 
the exponent to look something like p— 1 = 2” — 2. Observe that 9 = 37, so 
ge (82) a aa 3? = ppt 3? 


But 3?-! = 1 (mod p), so we get 92" = 32 = 9 (mod p), and thus 
92”"" —9 = 0 (mod p) as desired. 
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F.1.4 <A pair of odd vertices (20 points) 


Let G be a simple undirected graph (i.e., one with no self-loops or parallel 
edges), and let u be a vertex in G with odd degree. Show that there is 
another vertex v # u in G such that (a) v also has odd degree, and (b) there 
is a path from wu to v in G. 


Solution 


Let G’ be the connected component of u in G. Then G’ is itself a graph, and 
the degree of any vertex is the same in G’ as in G. Since the sum of all the 
degrees of vertices in G’ must be even by the Handshaking Lemma, there 
cannot be an odd number of odd-degree vertices in G’, and so there is some 
v in G’ not equal to u that also has odd degree. Since G’ is connected, there 
exists a path from wu to v. 


F.1.5 How many magmas? (20 points) 


Recall that a magma is an algebra consisting of a set of elements and one 
binary operation, which is not required to satisfy any constraints whatsoever 
except closure. Consider a set S of n elements. How many distinct magmas 
are there that have S' as their set of elements? 


Solution 


Since the carrier is fixed, we have to count the number of different ways of 
defining the binary operation. Let’s call the operation f. For each ordered 
pair of elements (x,y) € S x S, we can pick any element z € S for the value 
of f(x,y). This gives n choices for each of the n? pairs, which gives nr 
magmas on S. 


F.1.6 A powerful relationship (20 points) 


Recall that the powerset P(S) of a set S is the set of sets {A : A C S}. 
Prove that if S C T, then P(S) C P(T). 


Solution 


Let A € P(S); then by the definition of P(S) we have A C S. But then 
AC S C T implies A C T, and so A € P(T). Since A was arbitrary, 
AeéP(T) holds for all A in P(S), and we have P(S) C P(T). 
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F.1.7 A group of archaeologists (20 points) 


Archaeologists working deep in the Upper Nile Valley have discovered a 
curious machine, consisting of a large box with three levers painted red, yellow, 
and blue. Atop the box is a display that shows one of set of n hieroglyphs. 
Each lever can be pushed up or down, and pushing a lever changes the 
displayed hieroglyph to some other hieroglyph. The archaeologists have 
determined by extensive experimentation that for each hieroglyph x, pushing 
the red lever up when z is displayed always changes the display to the same 
hieroglyph f(a), and pushing the red lever down always changes hieroglyph 
f(x) to x. A similar property holds for the yellow and blue levers: pushing 
yellow up sends x to g(x) and down sends g(x) to x; and pushing blue up 
sends x to h(x) and down sends h(a) to x. 

Prove that there is a finite number / such that no matter which hieroglyph 
is displayed initially, pushing any one of the levers up k times leaves the 
display with the same hieroglyph at the end. 

Clarification added during exam: k > 0. 


Solution 


Let H be the set of hieroglyphs, and observe that the map f : H ~ H 
corresponding to pushing the red lever up is invertible and thus a permutation. 
Similarly, the maps g and h corresponding to yellow or blue up-pushes are 
also permutations, as are the inverses f—1, g~ 1, and h~1 corresponding to 
red, yellow, or blue down-pushes. Repeated pushes of one or more levers 
correspond to compositions of permutations, so the set of all permutations 
obtained by sequences of zero or more pushes is the subgroup G of the 
permutation group Sj; generated by f, g, and h. 

Now consider the cyclic subgroup (f) of G generated by f alone. Since 
G is finite, there is some index m such that f™ = e. Similarly there are 
indices n and p such that g” = e and h? = e. So pushing the red lever up 
any multiple of k times restores the initial state, as does pushing the yellow 
lever up any multiple of n times or the blue lever up any multiple of p times. 
Let k = mnp. Then k is a multiple of m, n, and p, and pushing any single 
lever up k times leaves the display in the same state. 


F.2 CS202 Final Exam, December 16th, 2005 


Write your answers in the blue book(s). Justify your answers. Work alone. 
Do not use any notes or books. 
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There are six problems on this exam, each worth 20 points, for a total of 
120 points. You have approximately three hours to complete this exam. 


F.2.1 Order (20 points) 


Recall that the order of an element x of a group is the least positive integer 
k such that «* = e, where e is the identity, or oo if no such k exists. 

Prove or disprove: In the symmetric group S, of permutations on n 
elements, the order of any permutation is at most Ce 


Clarifications added during exam 


e Assume n > 2. 


Solution 


Disproof: Consider the permutation (1 2)(3 4 5)(6 7 8 9 10)(11 12 13 14 15 
16 17) in S17. This has order 2-3-5-7 = 210 but (77) = +458 = 136. 


F.2.2 Count the subgroups (20 points) 


Recall that the free group over a singleton set {a} consists of all words of the 
form a", where k is an integer, with multiplication defined by a*a™ = a*+™, 
Prove or disprove: The free group over {a} has exactly one finite subgroup. 


Solution 


Proof: Let F be the free group defined above and let S' be a subgroup of F’. 
Suppose S' contains a” for some k 4 0. Then S contains a?*,a>*,... because 
it is closed under multiplication. Since these elements are all distinct, S is 
infinite. 

The alternative is that S does not contain a* for any k 4 0; this leaves 
only a° as possible element of S, and there is only one such subgroup: the 
trivial subgroup {a°}. 


F.2.3 Two exits (20 points) 


Let G = (V, F) be a nonempty connected undirected graph with no self-loops 
or parallel edges, in which every vertex has degree 4. Prove or disprove: For 
any partition of the vertices V into two nonempty non-overlapping subsets 
S and T, there are at least two edges that have one endpoint in S' and one 
in T. 
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Solution 


Proof: Because G is connected and every vertex has even degree, there is 
an Euler tour of the graph (a cycle that uses every edge exactly once). Fix 
some particular tour and consider a partition of V into two sets S$ and T. 
There must be at least one edge between S and 7, or G is not connected; 
but if there is only one, then the tour can’t return to S or T once it leaves. 
It follows that there are at least 2 edges between S and T as claimed. 


F.2.4 Victory (20 points) 


A sabermetrician wishes to test the hypothesis that a set of n baseball teams 
are stricty ranked, so that no two teams have the same rank and if some 
team A has a higher rank than some team B, A will always beat B in a 
7-game series. To test this hypothesis, the sabermetrician has each team 
play a 7-game series against each other team. 

Suppose that the teams are in fact all equally incompetent and that the 
winner of each series is chosen by an independent fair coin-flip. What is the 
probability that the results will nonetheless be consistent with some strict 
ranking? 


Solution 


Each ranking is a total order on the n teams, and we can describe such 
a ranking by giving one of the n! permutations of the teams. These in 
turn generate n! distinct outcomes of the experiment that will cause the 
sabermetrician to believe the hypothesis. To compute the probability that one 
of these outcomes occurs, we must divide by the total number of outcomes, 
giving 


Pr [strict ranking] = —~ 


F.2.5 An aggressive aquarium (20 points) 


A large number of juvenile piranha, weighing 1 unit each, are placed in an 
aquarium. Each day, each piranha attempts to eat one other piranha. If 
successful, the eater increases its weight to the sum of its previous weight 
and the weight of its meal (and the eaten piranha is gone); if unsuccessful, 
the piranha remains at the same weight. 

Prove that after k days, no surviving piranha weighs more than 2* units. 
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Clarifications added during exam 


e It is not possible for a piranha to eat and be eaten on the same day. 


Solution 


By induction on k. The base case is k = 0, when all piranha weigh exactly 
2° = 1 unit. Suppose some piranha has weight x < 2* after k days. Then 
either its weight stays the same, or it successfully eats another piranha of 
weight y < 2 increases its weight to 2 + y < 2* +2* = 2*+1. In either case 
the claim follows for k + 1. 


F.2.6 A subspace of matrices (20 points) 


Recall that a subspace of a vector space is a set that is closed under vector 
addition and scalar multiplication. Recall further that the subspace generated 
by a set of vector space elements is the smallest such subspace, and its 
dimension is the size of any basis of the subspace. 
Let A be the 2-by-2 matrix 
1 1 
0 1 


over the reals, and consider the subspace S$ of the vector space of 2-by-2 
real matrices generated by the set {A, A?, A®,...}. What is the dimension 
of S? 


Solution 


First let’s see what A* looks like. We have 
bw ee Al a. Ae 
a(t 1 ol]. 
1 1 1 2 1 3 
Bias. = 
tetra tee 


and in general we can show by induction that 


ea(S)( (84) 


Observe now that for any k, 
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Ab = ( : ; et) ( a )-0-9 ( ae = (k-1)A2—(k—2) A. 


It follows that {A, A?} generates all the A* and thus generates any linear 
combination of the A’ as well. It is easy to see that A and A? are linearly 
independent: if c,A + cA? = 0, we must have (a) c; + cg = 0 (to cancel 
out the diagonal entries) and (b) c; + 2c2 = 0 (to cancel out the nonzero 
off-diagonal entry). The only solution to both equations is cy = cg = 0. 

Because {A, A?} is a linearly independent set that generates S, it is a 
basis, and S has dimension 2. 


F.3. CS202 Final Exam, December 20th, 2007 


Write your answers in the blue book(s). Justify your answers. Work alone. 
Do not use any notes or books. 

There are six problems on this exam, each worth 20 points, for a total of 
120 points. You have approximately three hours to complete this exam. 


F.3.1 A coin-flipping problem (20 points) 


A particularly thick and lopsided coin comes up heads with probability 
py, tails with probability pr, and lands on its side with probability pg = 
1—(py+ pr). Suppose you flip the coin repeatedly. What is the probability 
that it comes up heads twice in a row at least once before the first time it 
comes up tails? 


Solution 


Let p be the probability of the event W that the coin comes up heads twice 
before coming up tails. Consider the following mutually-exclusive events for 


the first one or two coin-flips: 
Event A Pr[A] Pr[W|A] 


HH Diy 1 
HT PHPT 0 
HS PHPS p 
T PT 0 
Ss Ps 9) 


Summing over all cases gives 


p = py + pupsp + psp, 
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which we can solve for p to get 


ae Pi _ Pir _ Pir _ Pir 


 l-paps-ps pat+pr-paps prt+pyH(pyt+pr) pr+paprt+ py 


(Any of these is an acceptable answer.) 


F.3.2. An ordered group (20 points) 


Let G be a group and < a partial order on the elements of G such that for 
all x,y in G, x < xy. How many elements does G have? 


Solution 


The group G has exactly one element. 

First observe that G has at least one element, because it contains an 
identity element e. 

Now let x and y be any two elements of G. We can show x < y, because 
y = x(a-'y). Similarly, y < x = y(y~!w). But then x = y by antisymmetry. 
It follows that all elements of G are equal, i.e., that G has at most one 
element. 


F.3.3 Weighty vectors (20 points) 


Let the weight w(x) of an n x 1 column vector x be the number of nonzero 
elements of x. Call an n x n matrix A near-diagonal if it has at most one 
nonzero off-diagonal element; i.e., if there is at most one pair of indices 2, 7 
such that 1 A j and Aj; 4 0. 

Given n, what is the smallest value k such that there exists an n x 1 
column vector x with w(x) = 1 and a sequence of k n x n near-diagonal 
matrices A;, Ag,... Az such that w(A;A2--- Apx) = n? 


Solution 


Let’s look at the effect of multiplying a vector of known weight by just one 
near-diagonal matrix. We will show: (a) for any near-diagonal A and any z, 
w(Az) < w(x) +1, and (b) for any n x 1 column vector x with 0 < w(x) < n, 
there exists a near-diagonal matrix A with w(Az) > w(x) +1. 

To prove (a), observe that (Ax); = )7_, Aijx;. For (Az); to be nonzero, 
there must be some index j such that A;;7; is nonzero. This can occur in 
two ways: 7 = 7, and Aj; and x; are both nonzero, or j #7, and Aj; and 2; 
are both nonzero. The first case can occur for at most w(x) different values 
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of 7 (because there are only w(a) nonzero entries x;). The second can occur 
for at most one value of i (because there is at most one nonzero entry Aj; 
with i 4 7). It follows that Az has at most w(x) + 1 nonzero entries, i.e., 
that w(Azr) < w(x) +1. 

To prove (b), choose k and m such that x, = 0 and x, 4 0, and let A 
be the matrix with A;; = 1 for all i, Az, = 1, and all other entries equal to 
zero. Now consider (Ax);. If i # k, then (Ax); = 07_) Ajay = Auvi = 7. 
If i =k, then (Ai), = vye1 Aga; = Appt, + An tm: = tm ~ 0, simce we 
chose k so that ax, = 0 and chose m so that am 4 0. So (Ax); is nonzero if 
either x; is nonzero or i =k, giving w(Az) > w(x) +1. 

Now proceed by induction: 

For any k, if A, ... Az are near-diagonal matrices, then w(A1--- Aza) < 
w(x)+k. Proof: The base case of k = 0 is trivial. For larger k, w(Ai1--- Apxv) = 
w(Ai(Ag:-: Apx)) < w(Ag--+: Appx) +1 < w(x) + (K-1) +1 =u(2) +k. 

Fix x with w(x) = 1. Then for any k < n, there exists a sequence of near- 
diagonal matrices A,...A, such that w(A,---Apx) =k+1. Proof: Again 
the base case of k = 0 is trivial. For larger k < n, we have from the induction 
hypothesis that there exists a sequence of k — 1 near-diagonal matrices 
Ag... Az such that w(A2...Apx) = k <n. From claim (b) above we then 
get that there exists a near-diagonal matrix A, such that w(A;(A2g...Apx)) = 

Applying both these facts, setting k = n — 1 is necessary and sufficient 
for w(A,... Apa) = n, and so k = n — 1 is the smallest value of & for which 
this works. 


F.3.4 <A dialectical problem (20 points) 


Let S be a set with n elements. Recall that a relation R is symmetric if rRy 
implies yRx, antisymmetric if cRy and yRzx implies x = y, reflexive if «Rx 
for all x, and irreflexive if =(aRx) for all x. 


1. How many relations on S are symmetric, antisymmetric, and reflexive? 
2. How many relations on S' are symmetric, antisymmetric, and irreflexive? 


3. How many relations on S' are symmetric and antisymmetric? 


Solution 


Since in all three cases we are considering symmetric antisymmetric relations, 
we observe first that if R is such a relation, then «Ry implies yRa which in 
turn implies « = y. So any such R can have «Ry only if « = y. 
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1. Let R be symmetric, antisymmetric, and reflexive. We have already 
established that «Ry implies « = y. Reflexivity says x = y implies 
xRy, so we have «Ry iff x = y. Since this fully determines R, there is 
exactly 1 such relation. 


2. Now let R be symmetric, antisymmetric, and irreflexive. For « 4 y 
we have —(2#Ry) (from symmetry+antisymmetry); but for x = y, we 
again have —=(xRy) (from irreflexivivity). So R is the empty relation, 
and again there is exactly 1 such relation. 


3. Now for each x there is no constraint on whether xRzx holds or not, 
but we still have =(aRy) for z 4 y. Since we can choose whether «Rx 
holds independently for each x, we have n binary choices giving 2” 
possible relations. 


F.3.5 A predictable pseudorandom generator (20 points) 


Suppose you are given a pseudorandom number generator that generates a 
sequence of values 20,21, %2,... by the rule 2,41; = (ax; + b) mod p, where 
p is a prime and a, 6, and 20 are arbitrary integers in the range 0...p—1. 
Suppose further that you know the value of p but that a, b, and xo are secret. 


1. Prove that given any three consecutive values x;, 2441, Xj+2, it is possible 
to compute both a and b, provided x; 4 xj41. 


2. Prove that given only two consecutive values x; and 2;+1, it is impossible 
to determine a. 
Solution 
1. We have two equations in two unknowns: 


ax; +b=a2j41 (mod p) 


axi41 +b=axi42 (mod p). 
Subtracting the second from the first gives 


a(x; — Li41) = Ti41— Xi4Q (mod p). 


If 2; A vj41, then we can multiply both sides by (2; — 2;41)~! to get 
a = (xi41 — Vi42)(vi — Zi41)~' (mod p). 


Now we have a. To find }, plug our value for a into either equation 
and solve for b. 
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2. We will show that for any observed values of x; and 7;41, there are at 
least two different values for a that are consistent with our observation; 
in fact, we’ll show the even stronger fact that for any value of a, x; 
and x;41 are consistent with that choice of a. Proof: Fix a, and let 
b= xi41 — ax; (mod p). Then 2;41 = az; + b (mod p). 


F.3.6 At the robot factory (20 points) 


Each robot built by Rossum’s Combinatorial Robots consists of a head and 
a body, each weighing a non-negative integer number of units. If there are 
exactly 3” different ways to build a robot with total weight n, and exactly 2” 
different bodies with weight n, exactly how many different heads are there 
with weight n? 


Solution 


This is a job for generating functions! 

Let R= >O3"2" = = be the generating function for the number of 
robots of each weight, and let B = > 2"z" = = be the generating function 
for the number of bodies of each weight. Let H = )* hyz” be the generating 


function for the number of heads. Then we have R = BH, or 


He = oe 1 22 


BAH ae Ta 8y S 


So ho = 3° = 1, and for n > 0, we have hp = 3° —2-3"-1! = (3—2)3"-1 = 
ram 


= 


F.4 CS202 Final Exam, December 19th, 2008 


Write your answers in the blue book(s). Justify your answers. Work alone. 
Do not use any notes or books. 

There are five problems on this exam, each worth 20 points, for a total 
of 100 points. You have approximately three hours to complete this exam. 


F.4.1 Some logical sets (20 points) 


Let A, B, and C be sets. 
Prove or disprove: If, for all z, x € A > («& € B > x € C), then 
ANBCC. 
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Solution 


Proof: Rewriter € A> (rE BroreC\jaxr¢gAV(x¢_ BVXEC) or 
(c é AVa ¢ B)Vx eC. Applying De Morgan’s law we can convert the 
first OR into an AND to get =(a2 € AAx € B)V x € C. This can further be 
rewritten as (t@E€ ANE B) Ore. 

Now suppose that this expression is true for all x and consider some x in 
ANB. Thenz€ AAzxv€e Bis true. It follows that x € C is also true. Since 
this holds for every element x of AN B, we have AN BCC. 


F.4.2 Modularity (20 points) 


Let m be an integer greater than or equal to 2. For each a in Zp, let 
fa: Zm — Zm be the function defined by the rule fa(x) = az. 
Show that f, is a bijection if and only if ged(a,m) = 1. 


Solution 


From the extended Euclidean algorithm we have that if gcd(a,m) = 1, then 
there exists a multiplicative inverse a~! such that a~tax = x (mod m) for 
all ¢ in Zm. It follows that f, has an inverse function f;!, and is thus a 
bijection. 

Alternatively, suppose gcd(a,m) = g #1. Then fa(m/g) = am/g = 
m(a/g) = 0 = a-0 = fa(0) (mod m) but m/g 4 0 (mod m) since 0 < 
m/g<m. It follows that fa, is not injective and thus not a bijection. 


F.4.3 Coin flipping (20 points) 


Take a biased coin that comes up heads with probability p and flip it 2n 
times. 

What is the probability that at some time during this experiment two 
consecutive coin-flips come up both heads or both tails? 


Solution 


It’s easier to calculate the probability of the event that we never get two 
consecutive heads or tails, since in this case there are only two possible 
patterns of coin-flips: HTAT... or THTH.... Since each of these patterns 
contains exactly n heads and n tails, they occur with probability p”(1— p)”, 
giving a total probability of 2p"(1 — p)”. The probability that neither 
sequence occurs is then 1 — 2p"(1 — p)”. 
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F.4.4 A transitive graph (20 points) 


Let G be a graph with n vertices on which the adjacency relation is transitive: 
whenever there is an edge uv and an edge vw, there is also an edge uw. 
Suppose further that G is connected. How many edges does G have? 


Solution 


The graph G has exactly (5) edges. The reason is that under the stated 
conditions, G' is a complete graph. 

Consider any two vertices u and v. Because G is connected, there is a 
path u = v0v,...vz¢ = v starting at u and ending at v. We can easily prove 
by induction that there is an edge uv; for each 1 <i<k. The existence of 
the first such edge is immediate from its presence in the path. For later edges, 
we have from the induction hypothesis that there is an edge uv;, from the 
path that there is an edge v;v;41, and thus from the transitivity condition 
that there is and edge wvj41. When 7 = k, we have that there is an edge uv. 


F.4.5 <A possible matrix identity (20 points) 


Prove or disprove: If A and B are symmetric matrices of the same dimension, 
then A? — B? = (A— B)(A+ B). 


Solution 


Observe first that (A — B)(A+ B) = A? + AB— BA+ B?. The question 

then is whether AB = BA. Because A and B are symmetric, we have that 

BA= B'A' =(AB)’. So if we can show that AB is also symmetric, then 

we have AB = (AB)' = BA. Alternatively, if we can find symmetric matrices 

A and B such that AB is not symmetric, then A? — B? 4 (A— B)(A+ B). 
Let’s try multiplying two generic symmetric 2-by-2 matrices: 


a b\(d e\_ fad+be ae+bdf 

b c]\e f}] \bd+ce be+cf 
The product doesn’t look very symmetric, and in fact we can assign 
variables to make it not so. We need ae+ bf 4 bd+ ce. Let’s set b = 0 to 
make the bf and bd terms drop out, and e = 1 to leave just a and c. Setting 
a= 0 and c=1 gives an asymmetric product. Note that we didn’t determine 


dor f, so let’s just set them to zero as well to make things as simple as 
possible. The result is: 
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Which is clearly not symmetric. So for these particular matrices we have 
A? — B? 4 (A—B)(A+B), disproving the claim. 


F.5 CS202 Final Exam, December 14th, 2010 


Write your answers in the blue book(s). Justify your answers. Give closed- 

form solutions when possible. Work alone. Do not use any notes or books. 
There are five problems on this exam, each worth 20 points, for a total 

of 100 points. You have approximately three hours to complete this exam. 


F.5.1 Backwards and forwards (20 points) 


Let {0,1}” be the set of all binary strings 71x72...% of length n. 
For any string x in {0,1}”, let r(x) = a,ap,_1...21 be the reversal of x. 
Lett x~yifx=yorr=r(y). 


Given a string x in {0,1}” and a permutation 7 of {1,...,n}, let m(z) 
be the string @,(1),Tn(2);-++>2m(n)- Let x © y if there exists some 7 such 
that x = 7(y). 


Both ~ and * are equivalence relations. Let {0,1}"/~ and {0,1}"/= 
be the corresponding sets of equivalence classes. 


1. What is |{0,1}"/~| as a function of n? 


2. What is |{0,1}"/=| as a function of n? 


Solution 


1. Given a string x, the equivalence class [x] = {x,r(a)} has either one 
element (if « = r(x)) or two elements (if x 4 r(x)). Let mj, be the 
number of one-element classes and mz the number of two-element 
classes. Then |{0,1}"| = 2" = m + 2m and the number we are 
looking for is mj; + m2 = Ariat Pine — am es “. To find 
my, we must count the number of strings 71,...%, with 21 = 2p, 
r = Ln_1, etc. If n is even, there are exactly 2”/2 such strings, since 
we can specify one by giving the first n/2 bits (which determine the 
rest uniquely). If n is odd, there are exactly 2(n+1)/2 such strings, since 
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the middle bit can be set freely. We can write both alternatives as 
m= gin72l. giving |{0, 1}? /~| — gn-l4y g[n/2|-1. 


2. In this case, observe that x = y if and only if x and y contain the same 
number of 1 bits. There are n + 1 different possible values 0,1,...,n 
for this number. So |{0,1}"/=|=n+1. 


F.5.2 Linear transformations (20 points) 


Show whether each of the following functions from R? to R is a linear 
transformation or not. 


fi(z) = @1 — @2. 
fo(z) = 1122. 
f3(x) =a1 +2041. 
2 2 
_ &{ —La+X1— XQ 
El tjt+aetl © 


Clarification added during the exam: You may assume that 71+%2 4 
—1 for fa. 
Solution 
1. Linear: fi (ax) = ax, — axg = a(x, — £2) = af\(x) and fi(a+y) = 
(v1 + y1) — (%2 + yo) = (@1 — 2) + (yi — Y2) = f(z) + fily)- 
2. Not linear: fo(2x) = (221)(2%2) = 4a122 = 4fo(x) A 2fo(x) when 
fo(x) #0. 


3. Not linear: f3(2x) = 24, + 24, +1 but 2f3(x) = 2x1 + 2% + 2. These 
are never equal. 
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4. Linear: 

2 2 

_ &{— XOX — XQ 

fal) = t+2%4+1 
= (x1 + %2)(x1 — £2) + (x1 — 22) 
ei hd 
= (x4 + v2 + 1)(a41 = £2) 
wy+%24+1 
= £1 — 22 
= fi(2). 


Since we’ve already shown f; is linear, f4 = f; is also linear. 

A better answer is that fy is not a linear transformation from R? to 
R because it’s not defined when x1 + x29 —1=0. The clarification added 
during the exam tries to work around this, but doesn’t really work. A 
better clarification would have defined f, as above for most x, but have 
fa(a) = v1 — 2 when x1 + 22 = —1. Since I was being foolish about this 
myself, I gave full credit for any solution that either did the division or 
noticed the dividing-by-zero issue. 


F.5.3 Flipping coins (20 points) 
Flip n independent fair coins, and let X be a random variable that counts 
how many of the coins come up heads. Let a be a constant. What is E[a*]? 


Solution 


To compute E[a*], we need to sum over all possible values of a* weighted 
by their probabilities. The variable X itself takes on each value k € {0...n} 
with probability (7')2~”, so a* takes on each corresponding value a* with 
the same probability. We thus have: 
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The second to last step uses the Binomial Theorem. 

As a quick check, some easy cases are a = 0 with Efa*] = (1/2)", 
which is consistent with the fact that a* = 1 if and only if X = 0; and 
a= 1 with E[a*] = 1” = 1, which is consistent with a* = 1* = 1 being 
constant. Another easy case is n = 1, in which we can compute directly 
Ela*] = (1/2)a° + (1/2)a! = oth as given by the formula. So we can have 
some confidence that we didn’t mess up in the algebra somewhere. 

n 
Note also that E[a*] = (4) is generally not the same as aF IX] = q”/?, 


F.5.4 Subtracting dice (20 points) 


Let X and Y represent independent 6-sided dice, and let Z = |X — Y| be 
their difference. (For example, if X = 4 and Y = 3, then Z = 1, and similarly 
when X = 2 and Y =5, then Z = 3.) 


1. What is Pr[Z = 1]? 
2. What is E[Z]? 
3. What is E[Z|Z # 0]? 


Solution 


1. There are five cases where Z = 1 with Y = X +1 (because X can 
range from 1 to 5), and five more cases where Z = 1 with X = Y +1. 
So Pr[Z = 1] = # = #. 

2. Here we count 10 cases where Z = 1, 8 cases where Z = 2 (using 
essentially the same argument as above; here the lower die can range 
up to 4), 6 where Z = 3, 4 where Z = 4, and 2 where Z = 5. 
The cases where Z = 0 we don’t care about. Summing up, we get 

E[Z] = (10-1+8-2+6-34+4-4+2-5)/36 = 70/36 = 35/18. 


3. We can avoid recomputing all the cases by observing that E[Z] = 
Z|Z # 0] Pr[Z # 0] + E[Z|Z = 0] Pr[Z = Oj. Since E[Z|Z = 0] = 
the second term disappears and we can solve for E[Z|Z # 0] = 
Z|/Pr[Z # 0]. We can easily calculate Pr[Z = 0] = 1/6 (since both 
dice are equal in this case, giving 6 out of 36 possible rolls), from 
which we get Pr[Z 4 0] = 1 — Pr[Z = 0] = 5/6. Plugging this into our 


previous formula gives E[Z|Z 4 0] = a = 13 


Ho 


It is also possible (and acceptable) to solve this problem by building a 
table of all 36 cases and summing up the appropriate values. 
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F.5.5 Scanning an array (20 points) 


Suppose you have an m x m array in some programming language, that is, an 
data structure A holding a value Ai, 7] for each O <i < mand0 <j <™m. 
You’d like to write a program that sets every element of the array to zero. 

The usual way to do this is to start with 7 = 0 and 7 = 0, increment 7 
until it reaches m, then start over with i = 1 and 7 = 0, and repeat until all 
m? elements of A have been reached. But this requires two counters. Instead, 
a clever programmer suggests using one counter / that runs from 0 up to 
m? — 1, and at each iteration setting A[3k mod m, 7k mod m] to zero. 

For what values of m > 0 does this approach actually reach all m? 
locations in the array? 


Solution 


Any two inputs k that are equal mod m give the same pair (3k mod m, 7k mod 
m). So no matter how many iterations we do, we only reach m distinct 
locations. This equals m? only if m = 1 or m = 0. The problem statement 
excludes m = 0, so we are left with m = 1 as the only value of m for which 
this method works. 


Appendix G 


How to write mathematics 


Suppose you want to write down some mathematics. How do you do it? 


G.1 By hand 
This method is no longer recommended for CPSC 202 assignments. 


Advantages Don’t need to learn any special formatting tools: any symbol 
you can see you can copy. Very hard to make typographical errors. 


Disadvantages Not so good for publishing. Results may be ugly if you have 
bad handwriting. Results may be even worse if you copy somebody 
else’s bad handwriting. Requires a scanner or camera to turn into 


PDF. 


Example 


yy is tual 


n 
1=| 


G.2 BTRxX 


This is what these notes are written in. It’s also standard for writing papers 
in most technical fields. 


Advantages Very nice formatting. De facto standard for mathematics 
publishing. Free. Trivial to convert to PDF. 
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Disadvantages You have to install it and learn it. Can’t tell what some- 
thing looks like until you run it through a program. Cryptic and 
uninformative 1970’s-era error messages. The underlying system TRX 
is possibly the worst programming language in widespread use. 


Example 


The text above was generated by this source code: 


\begin{displaymath} 
\sum_{i=1}°n i = \frac{n(n+1)}{2}. 
\end{displaymath} 


although a real XTX document would also include some boilerplate 
around it. 


IATeX runs on the computers in the Zoo, and can be made to run on 
just about anything. There is a pretty good introductions to ATRX at 
https://en.wikibooks.org/wiki/LaTeX. 

The general rule of thumb for typesetting mathematics in ATRxX is that 
everything is represented in ASCII, with math typically delimited by dollar 
signs, special symbols represented by operators preceded by backslashes, and 
argument grouped using curly braces. The \begin and \end operators are 
used for larger structures, much like opening and closing tags in HTML. An 
example of a complete ATRX document that uses a few of the fancier features 
is given in Figure G.1. The formatted version appears in Figure G.2. 

There are front-ends to ATX like Lyx http: //www.lyx.org that try to 
make it WYSIWYG, with varying results. I don’t use any of them myself. 


G.3 Microsoft Word equation editor 
This is probably a bad habit to get into. 


Advantages There’s a good chance you already type things up in Word. 


Disadvantages Ugly formatting. Unpopular with conferences and journals, 
if you end up in the paper-writing business. 


I don’t use Word much, so no example. 
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\documentclass[12pt] {article} 

% what kind of programming language doesn’t let you 
% put in comments and define new commands? 
\newcommand{\twoTimes}[1]{2 \cdot {#1}} 

% sometimes it is useful to import packages 
\usepackage{amsmath} 

\usepackage{fullpage} 

\begin{document} 

\section{Introduction} 

This is a document written in \LaTeX{}. 


Each paragraph starts with a new line. 


\section{Contents} 
\label{section-contents} 


It is well known that 
the \textbf{inverse Ackermann function} $\alpha(n)$ is $0(\log n)$ 
and that $0(n \log n)$ is $0(n7{1+\epsilon})$ for any $\epsilon > O$. 
\begin{equation} 
\twoTimes{4} = 8 
\label{eq-two-times} 
\end{equation} 
I am sure \eqref{eq-two-times} is true, but this is \emph{not} a proof. 


\section{Conclusion} 


Look at all the great stuff we said in \S\ref{section-contents}! 
\end{document} 


Figure G.1: Source code for sample 4TRX document. 
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1 Introduction 

This is a document written in YTRX. 
Each paragraph starts with a new line. 

2 Contents 


It is well known that the inverse Ackermann function a(n) is O(log n) and that O(n log n) 
is O(n'**) for any € > 0. 


2-4=8 (1) 


Iam sure (1) is true, but this is not a proof. 


3 Conclusion 


Look at all the great stuff we said in §2! 


Figure G.2: Formatted sample ATX document. 
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G.4 Google Docs equation editor 
Pretty similar to the equation editor in Microsoft Word. 


Advantages Accessible for free from anywhere on the web. Easy export to 
PDF. 


Disadvantages Formatting not much better than Microsoft Word. 


G.5 ASCII and/or Unicode art 


This is the method originally used to format these notes, back when they lived 
at http://pine.cs.yale.edu/pinewiki/CS202. I also use it for putting 
equations in my own personal research notes, which are otherwise stored in 


flat ASCII files. 


Advantages Everybody can read ASCII and most people can read Unicode. 
No special formatting required. Results are mostly machine-readable. 


Disadvantages Very ugly formatting. Writing Unicode on a computer is a 
bit like writing Chinese—you need to learn how to input each possible 
character using whatever system you’ve got. May render strangely on 
some browsers. No easy way to convert to other formats like ATRX or 
PDF. 


Example sum[i=1 to n] i = n(m+1)/2 (ASCII). Or a fancy version: 


n 

\ n(nt+1) 
f- a 
--- 2 
i=1 


Amazingly enough, many mathematics papers from the typewriter era 
(pre-1980 or thereabouts) were written like this, often with the more obscure 
symbols inked in by hand. Fortunately (for readers at least), we don’t have 
to do this any more. 
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G.6 Markdown 
A compromise between ASCII and formatting languages like ATRX. 


Advantages Looks more like normal text than MT@X. Many tools exist for 
converting to other formats. Used by many web platforms. 


Disadvantages No special notation for mathematics (though some tools 
like pandoc allow embedded ETFX). Many variant syntaxes. 


Example 


* This is an itemized list with a *lot* of **shouting**. 
* It uses the formatting conventions expected by [‘pandoc‘] (http://pandoc.org). 
* It even includes an embedded LaTeX formula: $x72t+ty~2=z72$. 


Formatted version of example 


e This is an itemized list with a lot of shouting. 


e It uses the formatting conventions expected by pandoc. 


e It even includes an embedded LaTeX formula: x? + y? = 2?. 


Appendix H 


Tools from calculus 


Calculus is not a prerequisite for this course, and it is possible to have a 
perfectly happy career as a computer scientist without learning any calculus 
at all. But for some tasks, calculus is much too useful a tool to ignore. 
Fortunately, even though typical high-school calculus courses run a full 
academic year, the good parts can be understood with a few hours of 
practice. 


H.1 Limits 


The fundamental tool used in calculus is the idea of a limit. This is an 
approximation by nearby values to the value of an expression that we can’t 
calculate exactly, typically because it involves division by zero. 

The formal definition is that the limit as x goes to a of f(x) is c, written 


lim f(@) = 
if for any constant € > 0 there exists a constant 6 > 0 such that 
If) —e<e 
whenever 
ly—2| <6. 


The intuition is that as y gets closer to x, f(y) gets closer to c. 

The formal definition has three layers of quantifiers, so as with all quan- 
tified expressions it helps to think of it as a game played between you and 
some adversary that controls all the universal quantifiers. So to show that 
lim,.q = c, we have three steps: 
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e Some malevolent jackass picks €, and says “oh yeah, smart guy, I bet 
you can’t force f(y) to be within e€ of c.” 


e After looking at €, you respond with 6, limiting the possible values of 
y to the range [x — 6, x + 6]. 


e Your opponent wins if he can find a nonzero y in this range with f(y) 
outside [c — €,c + €]. Otherwise you win. 


For example, in the next section we will want to show that 


De 459 
lim Ce 2d 


20 z 


= 27: 


We need to take a limit here because the left-hand side isn’t defined when 
Z= 0; 

Before playing the game, it helps to use algebra to rewrite the left-hand 
side a bit: 


lim = lim 
20 WA 20 WA 
— tim 2202) + (2? 
20 2 


= lim 2x + z. 
z—0 


So now the adversary says “make |(27 + z) — 2a| < €,” and we say “that’s 
easy, let 6 = €, then no matter what z you pick, as long as |z — 0| < 6, we 
get |(2x + z) — 22| = |z| < 6 =€, QED.” And the adversary slinks off with 
its tail between its legs to plot some terrible future revenge. 

Of course, a definition only really makes sense if it doesn’t work if we 
pick a different limit. If we try to show 


2 2 
lim i) a z 


= 12, 
20 VA 


(assuming x # 6), then the adversary picks € < |12 — 2x]. Now we are out 
of luck: no matter what 6 we pick, the adversary can respond with some 
value very close to 0 (say, min(d/2,|12 — 2x|/2)), and we land inside +06 but 
outside 12 +. 

We can also take the limit as a variable goes to infinity. This has a 
slightly different definition: 


lief Ce) 6 
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holds if for any « > 0, there exists an N > 0, such that for all « > N, 
| f(x) —c| < . Structurally, this is the same 3-step game as before, except 
now after we see ¢€, instead of making x very close to a, we make «x very big. 
Limits as x goes to infinity are sometimes handy for evaluating asymptotic 
notation. 
Limits don’t always exist. For example, if we try to take 
lim 2?, 
xw—->0O0 
then there is no value c that x? eventually approaches. In this particular 
case, we can say that lim,,.. 7? diverges to infinity, which means that 
for any m, there is an N such that f(x) > m for all x > N. 
Other limits may not diverge to infinity but still may not exist. An 
example would be 
: n 
eit“ 
Since this oscillates between —1 and +1 at integer values of n (and does 
horrible things in the complex plane at other values), there is no particular 
value c that it ever approaches. 


H.2 Derivatives 


The derivative or differential of a function measures how much the function 
changes if we make a very small change to its input. One way to think about 
this is that for most functions, if you blow up a plot of them enough, you don’t 
see any curvature any more, and the function looks like a line that we can 
approximate as ax + b for some coefficients a and b, and the derivative gives 
the slope a. This is useful for determining whether a function is increasing 
or decreasing in some interval, and for finding things like local minima or 
maxima. 

The derivative f’(x) gives the coefficient a for each particular x. The 
notation f’ is due to Leibnitz and is convenient for functions that have names 
but not so convenient for something like 2? + 3. For more general functions, 
a different neuou ee to Newton is used. The derivative of f with respect 


to x is written as 7, or fe f, and its value for a particular value x = c is 


written using the somewhat horrendous notation 


a, 
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f(a) f'(2) 
Cc 0 
gh nar—1 
eo a 
a” a* Ina follows from a® = e™ 74 
Inz 1/z 
cg(x) cg’ (x) multiplication by a constant 
g(x) + h(x) g(x) + h'(x) sum rule 
g(x)h(z) = g(x)h'(x) + g'(x)h(x) product rule 
g(h(a)) q' (h(a) )h'(x) chain rule 


Table H.1: Table of derivatives 


There is a formal definitions of f’(x), which nobody ever uses, given by 


fie) = Yim Fete) fe) 


Az—>0 Az ; 


where Az is a single two-letter variable (not the product of A and z!) that 
represents the change in x. In the preceding section, we calculated an example 
of this kind of limit and showed that at = 2x. 

Using the formal definition to calculate derivatives is painful and almost 
always unnecessary. The reason is that the derivatives for most standard 
functions are well-known, and by memorizing a few simple rules you can 
combine these to compute the derivative of just about anything completely 
mechanically. A handy cheat sheet is given in Table H.1 


Example: 
d 2 ol’ 4. 1 5 
— . 1 
deInz ee Ing’ Ing dx pproductrile| 


d 1 
= 2 _—_— . = __ ] —— + 
=a 1- (mz) 2 a In a[chain rule] + ine 2a 


g 1 i Qa 
In?x x Ing 
—£ 22 


Inte Ina’ 

The idea is that whatever the outermost operation in an expression is, 
you can apply one of the rules above to move the differential inside it, until 
there is nothing left. Even computers can be programmed to do this. You 
can do it too. 
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H.3 Integrals 


First you have to know how to find derivatives (see previous section). Having 
learned how to find derivatives, your goal in integrating some function f(x) 
is to find another function F(x) such that F’(x) = f(x). You can then write 
that the indefinite integral { f(x) dz of f(x) is F(x) + C (any constant 
C works), and compute definite integrals with the rule 


b 
i f(x) dz = F(b) — F(a). 


Though we’ve mostly described integrals as anti-derivatives, there is a 
physical interpretation: f ? f(x) dx gives the area of the region under the f(z) 
curve between a and b (with negative values contributing negative area). In 
this view, an integral acts very much like a sum, and indeed the integral of 
many well-behaved functions! can be computed using the formula 


b-a 
faa = Jim x [ Ag |\f(at+iAz)Az. (H.3.1) 


Alternatively, one can also think of the definite integral [ : f(x) dx asa 
special case of the indefinite integral { f(a) dx = F(x) + C where we choose 
C = —F(a) so that F(a) + C = 0. In this case, F(b) + C = F(b) — F(a) = 
i is f(x) dx. Where this interpretation differs from “area under the curve” is 
that it works even if b < a. 

Returning to anti-differentiation, how do you find a magic F(x) with 
F"(x) = f(x)? Some possibilities: 


e Memorize some standard integral formulas. Some useful ones are given 
in Table H.2. 


e Guess but verify. Guess F(x) and compute F’(x) to see if it’s f(x). 
May be time-consuming unless you are good at guessing, and can 
put enough parameters in F(x) to let you adjust F’(x) to equal f(z). 
Example: if f(z) = 2/2, you may remember the 1/zx formula and 


‘One way to be a well-behaved function is to have a bounded derivative over [a, b]. This 
will make (H.3.1) work, in the sense of giving sensible results that are consistent with more 
rigorous definitions of integrals. 

An example of a non-well-behaved function is the non-differentiable function f with 
f(x) =1 if « is rational and f(x) = 0 if « is irrational. This is almost never 1, but (H.3.1) 
may give strange results when Az is chosen so that f(a +iAz hits a lot of rationals. More 
sophisticated definitions of integrals, like the Lebesgue integral, give more reasonable 
answers here. 
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f(z) F(z) 
f(x)+9(z) F(x) +G(z) 

af (x) aF (2) a is constant 

f (az) zee) a is constant 
ae mt n constant, n #1 
got Ina 
e* e* 
a” cK a constant 

lInz zlnx-2 


Table H.2: Table of integrals 


try F(x) = alnbz. Then F’(x) = ab/(br) = a/x and you can set 
a = 2, quietly forget you ever put in b, and astound your friends (who 
also forgot the af(x) rule) by announcing that the integral is 2lnz. 
Sometimes if the answer comes out wrong you can see how to fudge 
F(a) to make it work: if for f(x) = Inx you guess F(x) = xInz, then 
F’(z) =Inx+1 and you can notice that you need to add a —x term 
(the integral of —1) to get rid of the 1. This gives fInadxz = alnx—z. 


e There’s a technique called integration by parts, which is the integral 
version of the duv = udv + vdu formula, but it doesn’t work as often 
as one might like. The rule is that 


ude sw - fvdu, 


An example is f nadx = alnz— fad(Inz) =alnz— f x(1/r) dr = 
xlna—f1dz=xlax-—z. You probably shouldn’t bother memorizing 
this unless you need to pass AP Calculus, although you can rederive it 
from the product rule for derivatives. 


e Use a computer algebra system like Mathematica, Maple, or Max- 
ima. Mathematica’s integration routine is available on-line at http: 
//integrals.wolfram.com. 


e Look your function up in a big book of integrals. This is generally less 
effective than using Mathematica, but may continue to work during 
power failures. 


Note that in each of these cases, once you’ve found one F' with F’(x) = 
f(x), then any function of the form F(a) + C (where C is a constant) also 
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works. One of the reasons for spending a year on high-school calculus is that 
it takes that long to train you to remember to always write your integrals 
as F(x) +C. Fortunately, as soon as one calculates a definite integral 
fe f(a) dx = (F(b) + C) — (F(a) + C), the C’s cancel, so usually forgetting 
the constant will not cause too much trouble. 


Appendix I 


The natural numbers 


Here we give an example of how we can encode simple mathematics using 
predicate logic, and then prove theorems about the resulting structure. Our 
goal is to represent the natural numbers: 0, 1, 2, etc.! 


I.l1 The Peano axioms 


The Peano axioms represent natural numbers using a special 0 constant 
and a function symbol S (for “successor”; think of it as +1). Repeatedly 
applying S to 0 generates increasingly large natural numbers: SO = 1,550 = 
2,5:5:50 = 3, etc. (Note that 1, 2, 3, etc., are not part of the language, 
although we might use them sometimes as shorthand for long strings of 
S’s.) For convenience, we don’t bother writing parentheses for the function 
applications unless we need to do so to avoid ambiguity: read SS'SS0 as 
S(S(S(S(0)))). 

The usual interpretation of function symbols implies that 0 and its 
successors exist, but it doesn’t guarantee that they aren’t all equal to each 
other. The first Peano axiom? prevents this: 


Vat Sas 0, (P1) 


‘Some people define the natural numbers as starting at 1. Those people are generally 
(a) wrong, (b) number theorists, (c) extremely conservative, or (d) citizens of the United 
Kingdom of Great Britain and Northern Ireland. As computer scientists, we will count 
from 0 as the gods intended. 

?This is not actually the first axiom that Peano defined. The original Peano ax- 
ioms [Pea89, §1] included some axioms on existence of Sx and the properties of equality 
that have since been absorbed as standard rules of first-order logic. The axioms we are 
presenting here correspond to Peano’s axioms 8, 7, and 9. 


405 
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In English, 0 is not the successor of any number. 

This still allows for any number of nasty little models in which 0 is 
nobody’s successor, but we still stop before getting all of the naturals. For 
example, let S.S0 = SO; then we only have two elements in our model (0 and 
S0, because once we get to SO, any further applications of S keep us where 
we are. 

To avoid this, we need to prevent S from looping back round to some 
number we’ve already produced. It can’t get to 0 because of the first axiom, 
and to prevent it from looping back to a later number, we take advantage of 
the fact that they already have one successor: 


Ve:Vy: Sxr=Syou=y. (P2) 


If we take the contrapositive in the middle, we get « 4 y > Sax # Sy. In 
other words, we can’t have a single number z that is the successor of two 
different numbers x and y. 

Now we get all of N, but we may get some extra elements as well. There 
is nothing in the first two axioms that prevents us from having something 
like this: 


0> 50> SS0> SSSO0O>...B3> SB> SSB>SSSB-... 


where B stands for “bogus.” 

The hard part of coming up with the Peano axioms was to prevent the 
model from sneaking in extra bogus values (that still have successors and 
at most one predecessor each). This is (almost) done using the third Peano 
axiom, which in first-order logic—where we can’t write VP—is written as 
an axiom schema. This is a pattern that generates an infinite list of an 
axioms, one for each choice of predicate P: 


(P(0) A (Va: P(x) > P(S2))) > Va: P(2). (P3) 


This is known as the induction schema, and says that, for any predicate 
P, if we can prove that P holds for 0, and we can prove that P(x) implies 
P(« +1), then P holds for all x in N. The intuition is that even though 
we haven’t bothered to write out a proof of, say P(1337), we know that we 
can generate one by starting with P(0) and modus-pwning our way out to 
P(1337) using P(0) > P(1), then P(1) > P(2), then P(2) > P(3), etc. 
Since this works for any number (eventually), there can’t be some number 
that we missed. 

In particular, this lets us throw out the bogus numbers in the bad example 
above. Let B(x) be true if x is bogus (i-e., it’s equal to B or one of the other 
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values in its chain of successors). Let P(x) = =B(x). Then P(0) holds (0 is 
not bogus), and if P(x) holds (zx is not bogus) then so does P(Sx). It follows 
from the induction axiom that VxP(a): there are no bogus numbers. 


I.2 A simple proof 


Let’s use the Peano axioms to prove something that we know to be true about 
the natural numbers we learned about in grade school but that might not be 
obvious from the axioms themselves. (This will give us some confidence that 
the axioms are not bogus.) We want to show that 0 is the only number that 
is not a successor: 


Claim 1.2.1. Vz: (« £0) — (Ay: x = Sy). 


To find a proof of this, we start by looking at the structure of what we are 
trying to prove. It’s a universal statement about elements of N (implicitly, 
the Vx is really Vx € N, since our axioms exclude anything that isn’t in N), 
so our table of proof techniques suggests using an induction argument, which 
in this case means finding some predicate we can plug into the induction 
schema. 

If we strip off the Vz, we are left with 


(240) > Gy: 2 = Sy). 


Here a direct proof is suggested: assuming x 4 0, and try to prove 
dy: «2 = Sy. But our axioms don’t tell us much about numbers that aren’t 
0, so it’s not clear what to do with the assumption. This turns out to bea 
dead end. 

Recalling that A > B is the same thing as —A V B, we can rewrite our 
goal as 


x=O0Viy:a2= Sy. 


’There is a complication here. Peano’s original axioms were formulated in terms of 
second-order logic, which allows quantification over all possible predicates (you can 
write things like VP : P(x) + P(Szx)). So the bogus predicate we defined is implicitly 
included in that for-all. But if there is no first-order predicate that distinguishes bogus 
numbers from legitimate ones, the induction axiom won’t kick them out. This means 
that the Peano axioms (in first-order logic) actually do allow bogus numbers to sneak in 
somewhere around infinity. But they have to be very polite bogus numbers that never 
do anything different from ordinary numbers. This is probably not a problem except for 
philosophers. Similar problems show up for any model with infinitely many elements, due 
to something called the Lowenheim-Skolem theorem. 
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This seems like a good candidate for P (our induction hypothesis), because 
we do know a few things about 0. Let’s see what happens if we try plugging 
this into the induction schema: 


e P(0) =0=0Viy:0= Sy. The right-hand term looks false because 
of our first axiom, but the left-hand term is just the reflexive axiom 
for equality. P(0) is true. 


e VxP(x) + P(Sx). We can drop the Vz if we fix an arbitrary x. Expand 
the right-hand side P(Sx) = Sx =0V dySa = Sy. We can be pretty 
confident that Sx 4 0 (it’s an axiom), so if this is true, we had better 
show dySz = Sy. The first thing to try for 4 statements is instantiation: 
pick a good value for y. Picking y = x works. 


Since we showed P(0) and VaP(x) + P(Sx), the induction schema tells 
us VaP(x). This finishes the proof. 

Having figured the proof out, we might go back and clean up any false 
starts to produce a compact version. A typical mathematician might write 
the preceding argument as: 


Proof. By induction on x. For x = 0, the premise fails. For Sa, let y=a. UO 


A really lazy mathematician would write: 


Proof. Induction on x. 


Though laziness is generally a virtue, you probably shouldn’t be quite 
this lazy when writing up homework assignments. 


I.3 Defining addition 


Because of our restricted language, we do not yet have the ability to state 
valuable facts like 1+ 1 = 2 (which we would have to write as S0+ S0 = SS0). 
Let’s fix this, by adding a two-argument function symbol + which we will 
define using the axioms 


er+0=2. 


ex+Sy=S(r4+y). 
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(We are omitting some V quantifiers, since unbounded variables are 
implicitly universally quantified.) 

This definition is essentially a recursive program for computing x+y using 
only successor, and there are some programming languages (e.g. Haskell) 
that will allow you to define addition using almost exactly this notation. If 
the definition works for all inputs to +, we say that + is well-defined. Not 
working would include giving different answers depending on which parts 
of the definitions we applied first, or giving no answer for some particular 
inputs. These bad outcomes correspond to writing a buggy program. Though 
we can in principle prove that this particular definition is well-defined (using 
induction on y), we won’t bother. Instead, we will try to prove things about 
our new concept of addition that will, among other things, tell us that the 
definition gives the correct answers. 

We start with a lemma, which is Greek for a result that is not especially 
useful by itself but is handy for proving other results.* 


Lemma 1.3.1. 0+ c= 2. 


Proof. By induction on «. When x = 0, we have 0+ 0 = 0, which is true 
from the first case of the definition. Now suppose 0+ x = x and consider 
what happens with Sx. We want to show 0+ Sx = Sx. Rewrite 0+ Sa as 
S(0+ 2) [second case of the definition], and use the induction hypothesis to 
show $(0+ x) = S(z). 


(We could do a lot of QED-ish jumping around in the end zone there, 
but it is more refined—and lazier—to leave off the end of the proof once it’s 
clear we’ve satisifed all of our obligations.) 

Here’s another lemma, which looks equally useless: 


Lemma 1.3.2. 7+ Sy = Sa+y. 


Proof. By induction on y. If y = 0, then x + SO = S(x+0) = Sx = Sx +0. 
Now suppose the result holds for y and show 7+ SSy = Sx+ Sy. We have 
c+ SSy = S(x+ Sy) = S(Sx+4+ y)[ind. hyp.] = Sx+ Sy. 


Now we can prove a theorem: this is a result that we think of as useful 
in its own right. (In programming terms, it’s something we export from a 
module instead of hiding inside it as an internal procedure.) 


Theorem 1.3.3. c+ y=y+2. (Commutativity of addition.) 


“It really means fork. 
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Proof. By induction on x. If = 0, then 0+ y = y +0 (see previous lemma). 
Now suppose «+ y = y+ 2, and we want to show Sx+y=y+ Sz. But 
y+Sa = S(y+x)[axiom] = $(x+y)[induction hypothesis] = 7+ S'y[axiom] = 
Sx + y[lemma]. 


This sort of definition-lemma-lemma-theorem structure is typical of writ- 
ten mathematical proofs. Breaking things down into small pieces (just like 
breaking big subroutines into small subroutines) makes debugging easier, 
since you can check if some intermediate lemma is true or false without 
having to look through the entire argument at once. 

Question: How do you know which lemmas to prove? Answer: As when 
writing code, you start by trying to prove the main theorem, and whenever 
you come across something you need and can’t prove immediately, you fork it 
off as a lemma. Conclusion: The preceding notes were not originally written 
in order. 


1.3.1 Other useful properties of addition 


So far we have shown that «+ y = y+ 2, also known as commutativity 
of addition. Another familiar property is associativity of addition: 
x+(y+z)=(x%+y)+2. This is easily proven by induction (try it!) 

We don’t have subtraction in N (what’s 3 — 5?)° The closest we can get 
is cancellation: 


Lemma 1.3.4. c+ y=2+2—>y=2z. 


We can define < for N directly from addition: Let « < y= dzr+z=y. 
Then we can easily prove each of the following (possibly using our previous 
results about addition having commutativity, associativity, and cancellation). 


exr<yAy<270K<z. 


eaxbAc<do-atcec<be+d. 


°This actually came up on a subtraction test I got in the first grade from the terrifying 
Mrs Garrison at Mountain Park Elementary School in Berkeley Heights, New Jersey. She 
insisted that —2 was not the correct answer, and that we should have recognized it as a 
trick question. She also made us black out the arrow the left of the zero on the number-line 
stickers we had all been given to put on the top of our desks. Mrs Garrison was, on the 
whole, a fine teacher, but she did not believe in New Math. 
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exrsyAyStrr=y. 


(The actual proofs will be left as an exercise for the reader.) 


I.4 A scary induction proof involving even num- 
bers 


Let’s define the predicate Even(x) = Jyx = y+y. (The use of = here signals 
that Even(z) is syntactic sugar, and we should think of any occurrence of 
Even(x) as expanding to dyx = y + y.) 

It’s pretty easy to see that 0 = 0+0 is even. Can we show that SO is not 
even? 


Lemma I.4.1. =Even(S0). 


Proof. Expand the claim as -4JyS0 = y+ y=VyS0 #Ay+y. Since we are 
working over N, it’s tempting to try to prove the Vy bit using induction. But 
it’s not clear why SO 4 y+ y would tell us anything about SO 4 Sy + Sy. 
So instead we do a case analysis, using our earlier observation that every 
number is either 0 or Sz for some z. 


Case 1 y=0. Then S0 4 0+0 since 0+ 0 = 0 (by the definition of +) and 
0 4 SO (by the first axiom). 


Case 2 y = Sz. Then yt+y = $2z+Sz = S(Sz+z) = 9(z+Sz) = SS(z+z).® 
Suppose S0 = SS(z+z) [Note: “Suppose” usually means we are starting 
a proof by contradiction]. Then 0 = S(z+ z) [second axiom], violating 
Vx0 4 Sx [first axiom]. So S04 SS(z+2z)=y+y. 


Since we have S0 #4 y+ y in either case, it follows that SO is not even. 


Maybe we can generalize this lemma! If we recall the pattern of non-even 
numbers we may have learned long ago, each of them (1,3,5,7,...) happens 
to be the successor of some even number (0, 2,4,6,...). So maybe it holds 
that: 


Theorem I[.4.2. Even(x) > 7Even(Sz). 


Proof. Expanding the definitions gives (Syz = y+ y) > (4d2Sx = 24+ 2). 
This is an implication at top-level, which calls for a direct proof. The 
assumption we make is Jyx = y+ y. Let’s pick some particular y that makes 


°What justifies that middle step? 
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this true (in fact, there is only one, but we don’t need this). Then we can 
rewrite the right-hand side as =4zS(y + y) = z+ 2. There doesn’t seem 
to be any obvious way to show this (remember that we haven’t invented 
subtraction or division yet, and we probably don’t want to). 

We are rescued by showing the stronger statement VyndzS(y+y) = z+2: 
this is something we can prove by induction (on y, since that’s the variable 
inside the non-disguised universal quantifier). Our previous lemma gives the 
base case =4z.5(0 + 0) = z + z, so we just need to show that 7=4zS(y+y) = 
z+ zimpliesaizS(Sy + Sy) = z+ 2. Suppose that S(Sy + Sy) = z+4 z for 
some z |“suppose” = proof by contradiction again: we are going to drive this 
assumption into a ditch]. Rewrite S(Sy+ Sy) to get SSS(y+y) = z+ 2. 
Now consider two cases: 


Case 1 z= 0. Then SSS(y+ y) =0+0 = 0, contradicting our first axiom. 


Case 2 z= Sw. Then SSS(y+y) = Sw+ Sw = SS(w+w). Applying the 
second axiom twice gives S(y+y) =w+w. But this contradicts the 
induction hypothesis. 


Since both cases fail, our assumption must have been false. It follows that 
S(Sy + Sy) is not even, and the induction goes through. 


I.5 Defining more operations 
Let’s define multiplication (-) by the axioms: 
e0-y=0. 
e Sx-y=ytu-y. 


Some properties of multiplication: 


ezr:-0=0. 
el-xv=a. 
ezc-l=z. 


er y=y-d. 


ex-(y-z)=(@-y)-z. 


LAIOAG-Y=uU-Z>OY=z. 
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ez-(ytz)=z-y+2-z. 
exrsyor-acz-y. 
ezf#0AzZ-4<z2-yoOruK<y. 


(Note we are using 1 as an abbreviation for SO.) 

The first few of these are all proved pretty much the same way as for 
addition. Note that we can’t divide in N any more than we can subtract, 
which is why we have to be content with multiplicative cancellation. 

Exercise: Show that the Even(x) predicate, defined previously as Jyy = 
x +2, is equivalent to Even'(#) = Jyr = 2-y, where 2 = S$S0. Does this 
definition make it easier or harder to prove ~Even’ ($0)? 
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